【Question answering of the past and present】

  1. A pre-existence

The traditional question answering (QA) system is an application of Artificial Intelligence (AI).  It is usually confined to a very narrow and specialized domain, which is basically made up of a hand-crafted knowledge base with a natural language interface. As the field is narrow, the vocabulary is very limited, and its pragmatic ambiguity can be effectively under control. Questions are highly predictable, or close to a closed set, the rules for the corresponding answers are fairly straightforward. Well-known projects in the 1960s include LUNAR, a QA system specializing in answering questions about the geological analysis on the lunar samples collected from the Apollo's landing on the Moon.  SHRDLE is another famous QA expert system in AI history, it simulates the operation of a robot in the toy building world. The robot can answer the question of the geometric state of a toy and listen to the language instruction for its operation.

These early AI explorations seemed promising, revealing a fairy-tale world of scientific fantasy, greatly stimulating our curiosity and imagination. Nevertheless, in essence, these are just toy systems that are confined to the laboratory and are not of much practical value. As the field of artificial intelligence was getting narrower and narrower (although some expert systems have reached a practical level, majority AI work based on common sense and knowledge reasoning could not get out beyond lab), the corresponding QA systems failed to render meaningful results. There were some conversational systems (chatterbots) that had been developed thus far and became children's popular online toys (I remember at one time when my daughter was young, she was very fond of surfing the Internet to find various chatbots, sometimes deliberately asking tricky questions for fun.  Recent years have seen a revival of this tradition by industrial giants, with some flavor seen in Siri, and greatly emphasized in Microsoft's Little Ice).

2. Rebirth

Industrial open-domain QA systems are another story, it came into existence with the development of the Internet boom and the popularity of search engines. Specifically, the open QA system was born in 1999, when the TREC-8 (Eighth Text Retrieval Conference) decided to add a natural language QA track of competition, funded by the US Department of Defense's DARPA program, administrated by the United States National Institute of Standards and Technology (NIST), thus giving birth to this emerging QA community.  Its opening remarks when calling for the participation of the competition are very impressive, to this effect:

Users have questions, they need answers. Search engines claim that they are doing information retrieval, yet the information is not an answer to their questions but links to thousands of possibly related files. Answers may or may not be in the returned documents. In any case, people are compelled to read the documents in order to find answers. A QA system in our vision is to solve this key problem of information need. For QA, the input is a natural language question, the output is the answer, it is that simple.

It seems of benefit to introduce some background for academia as well as the industry when the open QA was born.

From the academic point of view, the traditional sense of artificial intelligence is no longer popular, replaced by the large-scale corpus-based machine learning and statistical research. Linguistic rules still play a role in the field of natural language, but only as a complement to the mainstream machine learning. The so-called intelligent knowledge systems based purely on knowledge or common sense reasoning are largely put on hold by academic scholars (except for a few, such as Dr. Douglas Lenat with his Cyc). In the academic community before the birth of open-domain question and answering, there was a very important development, i.e. the birth and popularity of a new area called Information Extraction (IE), again a child of DARPA. The traditional natural language understanding (NLU) faces the entire language ocean, trying to analyze each sentence seeking a complete semantic representation of all its parts. IE is different, it is task-driven, aiming at only the defined target of information, leaving the rest aside.  For example, the IE template of a conference may be defined to fill in the information of the conference [name], [time], [location], [sponsors], [registration] and such. It is very similar to filling in the blank in a student's reading comprehension test. The idea of task-driven semantics for IE shortens the distance between the language technology and practicality, allowing researchers to focus on optimizing tasks according to the tasks, rather than trying to swallow the language monster at one bite. By 1999, the IE community competitions had been held for seven annual sessions (MUC-7: Seventh Message Understanding Conference), the tasks of this area, approaches and the then limitations were all relatively clear. The most mature part of information extraction technology is the so-called Named Entity (NE tagging), including identification of names for human, location, and organization as well as tagging time, percentage, etc. The state-of-the-art systems, whether using machine learning or hand-crafted rules, reached a precision-recall combined score (F-measures) of 90+%, close to the quality of human performance. This first-of-its-kind technological advancement in a young field turned out to play a key role in the new generation of open-domain QA.

In industry, by 1999, search engines had grown rapidly with the popularity of the Internet, and search algorithms based on keyword matching and page ranking were quite mature. Unless there was a methodological revolution, the keyword search field seemed to almost have reached its limit. There was an increasing call for going beyond basic keyword search. Users were dissatisfied with search results in the form of links, and they needed more granular results, at least in paragraphs (snippets) instead of URLs, preferably in the form of direct short answers to the questions in mind.  Although the direct answer was a dream yet to come true waiting for the timing of open-domain QA era, the full-text search more and more frequently adopted paragraph retrieval instead of simple document URLs as a common practice in the industry, the search results changed from the simple links to web pages to the highlighting of the keywords in snippets.

In such a favorable environment in industry and academia, the open-domain question answering came onto the stage of history. NIST organized its first competition, requiring participating QA systems to provide the exact answer to each question, with a short answer of no more than 50 bytes in length and a long answer no more than 250 bytes. Here are the sample questions for the first QA track:

Who was the first American in space?
Where is the Taj Mahal?
In what year did Joe DiMaggio compile his 56-game hitting streak?

3. Short-lived prosperity

What are the results and significance of this first open domain QA competition? It should be said that the results are impressive, a milestone of significance in the QA history. The best systems (including ours) achieve more than 60% correct rate, that is, for every three questions, the system can search the given corpus and is able to return two correct answers. This is a very encouraging result as a first attempt at an open domain system. At the time of dot.com's heyday, the IT industry was eager to move this latest research into information products and revolutionize the search. There were a lot of interesting stories after that (see my related blog post in Chinese: "the road to entrepreneurship"), eventually leading to the historical AI event of IBM Watson QA beating humans in Jeopardy.

The timing and everything prepared by then from the organizers, the search industry, and academia, have all contributed to the QA systems' seemingly miraculous results. The NIST emphasizes well-formed natural language questions as appropriate input (i.e. English questions, see above), rather than traditional simple and short keyword queries.  These questions tend to be long, well suited for paragraph searches as a leverage. For competition's sake, they have ensured that each question asked indeed has an answer in the given corpus. As a result, the text archive contains similar statements corresponding to the designed questions, having increased the odds of sentence matching in paragraph retrieval (Watson's later practice shows that from the big data perspective, similar statements containing answers are bound to appear in text as long as a question is naturally long). Imagine if there are only one or two keywords, it will be extremely difficult to identify relevant paragraphs and statements that contain answers. Of course, finding the relevant paragraphs or statements is not sufficient for this task, but it effectively narrows the scope of the search, creating a good condition for pinpointing the short answers required.  At this time, the relatively mature technology of named entity tagging from the information extraction community kicked in.  In order to achieve the objectivity and consistency in administrating the QA competition, the organizers deliberately select only those questions which are relatively simple and straightforward, questions about names, time or location (so-called factoid questions).  This practice naturally agrees with the named entity task closely, making the first step into open domain QA a smooth process, returning very encouraging results as well as a shining prospect to the world. For example, for the question "In what year did Joe DiMaggio compile his 56-game hitting streak?", the paragraph or sentence search could easily find text statements similar to the following: "Joe DiMaggio's 56 game hitting streak was between May 15, 1941 and July 16".  An NE system tags 1941 as time with no problem and the asking point for time in parsing the wh-phrase "in what year" is also not difficult to decode. Therefore, an exact answer to the exact question seems magically retrieved from the sea of documents to satisfy the user, like a needle found in the haystack. Following roughly the same approach, equipped with gigantic computing power for parallel processing of big data, 11 years later, IBM Watson QA beat humans in the Jeopardy live show in front of the nationwide TV audience, stimulating the entire nation's imagination with awe for this technology advance.  From QA research perspective, the IBM's victory in the show is, in fact, an expected natural outcome, more of an engineering scale-up showcase rather than research breakthrough as the basic approach of snippet + NE + asking-point has long been proven.

A retrospect shows that adequate QA systems for factoid questions are invariably combined with a solid Named Entity module and a question parser for identifying asking points.  As long as there is an IE-indexed big data behind, with information redundancy as its nature, factoid QA is a very tractable task .

4. State of the art

The year 1999 witnessed the academic community's initial success of the first open-domain QA track as a new frontier of the retrieval world.  We also benefited from that event as a winner, having soon secured a venture capital injection of $10 million from the Wall Street. It was an exciting time shortly after AskJeeves' initial success in presenting a natural language interface online (but they did not have the QA technology for handling the huge archive for retrieving exact answers automatically, instead they used human editors behind the scene to update the answers database).  A number of QA start-ups were funded.  We were all expecting to create a new era in the information revolution. Unfortunately, the good times are not long, the Internet bubble soon burst, and the IT industry fell into the abyss of depression.  Investors tightened their monetary operations, the QA heat soon declined to freezing point and almost disappeared from the industry (except for giants' labs such as IBM Watson; in our case, we shifted from QA to mining online brand intelligence for enterprise clients). No one in the mainstream believes in this technology anymore. Compared with traditional keyword indexing and searching, the open domain QA  is not as robust and is yet to scale up to really big data for showing its power. The focus of the search industry is shifting from depth back to breadth, focusing on the indexing coverage, including the so-called deep web. As the development of QA systems is almost extinct from the industry, this emerging field stays deeply rooted in the academic community, developed into an important branch, with increasing natural language research from universities and research labs. IBM later solves the scale-up challenge, as a precursor of the current big data architectural breakthrough.

At the same time, scholars begin to summarize the various types of questions that challenge QA. A common classification is based on identifying the type of questions for their asking points.  Many of us still remember our high school language classes, where the teacher stressed the 6 WHs for reading comprehension: who / what / when / where / how / why. (Who did what when, where, how and why?)  Once answers to these questions are clear , the central stories of an article are in hands. As a simulation of human reading comprehension, the QA system is designed to answer these key WH questions as well. It is worth noting that these WH questions are of different difficulty levels, depending on the types of asking points (one major goal for question parsing is to identify the key need from a question, what we call asking point identification, usually based on question parsing of wh-phrases and other question clues). Those asking points corresponding to an entity as an appropriate answer, such as who / when / where, are relatively easy questions to answer (i.e. factoid questions). Another type of question is not simply answerable by an entity, such as what-is / how / why, there is consensus that answering such questions is a much more challenging task than factors questions.  A brief introduction to these three types of "tough" questions and their solutions are presented below as a showcase of the on-going state to conclude this overview of the QA journey.

What/who is X? This type of questions is the so-called definition question, such as What is iPad II? Who is Bill Clinton? This type of question is typically very short, after the wh-word and the stop word "is" are stripped in question parsing, what is left is just a name or a term as input to the QA system.  Such an input is detrimental to the traditional keyword retrieval system as it ends up with too many hits from which the system can only pick the documents with the most keyword density or page rank as returns.  But from QA perspective, the minimal requirement to answer this question is a definition statement in the forms of "X is a ...".  Since any entity or object is in multiple relationships with other entities and involved in various events as described in the corpus, a better answer to the definition question involves a summary of the entity with all the links to its key associated relations and events, giving a profile of the entity.  Such technology is in existence, and, in fact, has been partly deployed today. It is called knowledge graph, supported by underlying information extraction and fusion. The state-of-the-art solution for this type of questions is best illustrated in the Google deployment of its knowledge graph in handling queries of a short search for movie stars or other VIP.

The next challenge is how-questions, asking about a solution for solving a problem or doing something, e.g. How can we increase bone density? How to treat a heart attack?  This type of question calls for a summary of all types of solutions such as medicine, experts, procedures, or recipe.  A simple phrase is usually not a good answer and is bound to miss varieties of possible solutions to satisfy the information need of the users (often product designers, scientists or patent lawyers) who typically are in the stage of prior art research and literature review for a conceived solution in mind.  We have developed such a powerful system based on deep parsing and information extraction to answer open-domain how-questions comprehensively in the product called Illumin8, as deployed by Elsevier for quite some years.  (Powerful as it is, unfortunately, it did not end up as a commercial success in the market from revenue perspective.)

The third difficult question is why.  People ask why-questions to find the cause or motive of a phenomenon, whether an event or an opinion.  For example, why people like or dislike our product Xyz?  There might be thousands of different reasons behind a sentiment or opinion.   Some reasons are explicitly expressed (I love the new iPhone 7 because of its greatly enhanced camera) and more reasons are actually in some implicit expressions (just replaced my iPhone , it sucks in battery life).  An adequate QA system should be equipped with the ability to mine the corpus and summarize and rank the key reasons for the user.  In the last 5 years, we have developed a customer insight product that can answer why questions behind the public opinions and sentiments for any topics by mining the entire social media space.

Since I came to the Silicon Valley 9 years ago, I have been lucky, with pride, in having had a chance to design and develop QA systems for answering the widely acknowledged challenging questions.  Two products for answering the open-domain how questions and why-questions in addition to deep sentiment analysis have been developed and deployed to global customers.  Our deep parsing and IE platform is also equipped with the capability to construct deep knowledge graph to help answer definition questions, but unlike Google with its huge platform for the search needs, we have not identified a commercial opportunity to deploy that capability for a market yet.

This  piece of writing first appeared in 2011 in my personal blog, with only limited revisions since. Thanks to Google Translate at https://translate.google.com/ for providing a quick basis, which was post-edited by myself.  

 

[Related]

Http://en.wikipedia.org/wiki/Question_answering

The Anti-Eliza Effect, New Concept in AI

"Knowledge map and open-domain QA (1)" (in Chinese)

"knowledge map and how-question QA (2)"  (in Chinese)

Ask Jeeves and its million-dollar idea for human interface in 】(in Chinese)

Dr Li’s NLP Blog in English

 

Newest GNMT: time to witness the miracle of Google Translate

gnmt

Wei:
Recently, the microblogging (wechat) community is full of hot discussions and testing on the newest annoucement of the Google Translate breakthrough in its NMT (neural network-based machine translation) offering, claimed to have achieved significant progress in data quality and readability.  Sounds like a major breakthrough worthy of attention and celebration.

The report says:

Ten years ago, we released Google Translate, the core algorithm behind this service is PBMT: Phrase-Based Machine Translation.  Since then, the rapid development of machine intelligence has given us a great boost in speech recognition and image recognition, but improving machine translation is still a difficult task.

Today, we announced the release of the Google Neural Machine Translation (GNMT) system, which utilizes state-of-the-art training techniques to maximize the quality of machine translation so far. For a full review of our findings, please see our paper "Google`s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation."A few years ago, we began using RNN (Recurrent Neural Networks) to directly learn the mapping of an input sequence (such as a sentence in a language) to an output sequence (the same sentence in another language). The phrase-based machine learning (PBMT) breaks the input sentences into words and phrases, and then largely interprets them independently, while NMT interprets the entire sentence of the input as the basic unit of translation .

A few years ago, we began using RNN (Recurrent Neural Networks) to directly learn the mapping of an input sequence (such as a sentence in a language) to an output sequence (the same sentence in another language).  The phrase-based machine learning (PBMT) breaks the input sentences into words and phrases, and then largely interprets them independently, while NMT interprets the entire sentence of the input as the basic unit of translation .The advantage of this approach is that compared to the previous phrase-based translation system, this method requires less engineering design. When it was first proposed, the accuracy of the NMT on a medium-sized public benchmark

The advantage of this approach is that compared to the previous phrase-based translation system, this method requires less engineering design. When it was first proposed, the accuracy of the NMT on a medium-sized public benchmark data set was comparable to that of a phrase-based translation system.  Since then, researchers have proposed a number of techniques to improve NMT, including modeling external alignment models to handle rare words, using attention to align input and output words, and word decomposition into smaller units to cope with rare words. Despite these advances, the speed and accuracy of NMT has not been able to meet the requirements of a production system such as Google Translate.  Our new paper describes how to overcome many of the challenges of making NMT work on very large data sets and how to build a system that is both fast and accurate enough to deliver a better translation experience for Google users and services.

............

Using side-by-side comparisons of human assessments as a standard, the GNMT system translates significantly better than the previous phrase-based production system.  With the help of bilingual human assessors, we found in sample sentences from Wikipedia and the news website that GNMT reduced translational errors by 55% to 85% or more in the translation of multiple major pairs of languages.

In addition to publishing this research paper today, we have also announced that GNMT will be put into production in a very difficult language pair (Chinese-English) translation.

Now, the Chinese-English translations of the Google Translate for mobile and web versions have been translated at 100% using the GNMT machine - about 18 million translations per day.  GNMT's production deployment uses our open machine learning tool suite TensorFlow and our Tensor Processing Units (TPUs), which provide sufficient computational power to deploy these powerful GNMT models, meeting Google Translate strict latency requirements for products.

Chinese-to-English translation is one of the more than 10,000 language pairs supported by Google Translate. In the coming months, we will continue to extend our GNMT to far more language pairs.

GNMT translated from Google Translate achieves a major breakthrough!

As an old machine translation researcher, this temptation cannot be resisted.  I cannot wait to try this latest version of the Google Translate for Chinese-English.
Previously I tried Google Chinese-to-English online translation multiple times, the overall quality was not very readable and certainly not as good as its competitor Baidu.  With this newest breakthrough using deep learning with neural networks, it is believed to get close to human translation quality.  I have a few hundreds of Chinese blogs on NLP, waiting to be translated as a try.  I was looking forward to this first attempt in using Google Translate for my Science Popularization blog titled Introduction to NLP Architecture.  My adventure is about to start.  Now is the time to witness the miracle, if miracle does exist.

Dong:
I hope you will not be disappointed.  I have jokingly said before: the rule-based machine translation is a fool, the statistical machine translation is a madman, and now I continue to ridicule: neural machine translation is a "liar" (I am not referring to the developers behind NMT).  Language is not a cat face or the like, just the surface fluency does not work, the content should be faithful to the original!

Wei:
Let us experience the magic, please listen to this translated piece of my blog:

This is my Introduction to NLP Architecture fully automatically translated by Google Translate yesterday (10/2/2016) and fully automatically read out without any human interference.  I have to say, this is way beyond my initial expectation and belief.

Listen to it for yourself, the automatic speech generation of this science blog of mine is amazingly clear and understandable. If you are an NLP student, you can take it as a lecture note from a seasoned NLP practitioner (definitely clearer than if I were giving this lecture myself, with my strong accent). The original blog was in Chinese and I used the newest Google Translate claimed to be based on deep learning using sentence-based translation as well as character-based techniques.

Prof. Dong, you know my background and my original doubtful mindset. However, in the face of such a progress, far beyond our original imagination limits for automatic translation in terms of both quality and robustness when I started my NLP career in MT training 30 years ago, I have to say that it is a dream come true in every sense of it.

Dong:
In their terminology, it is "less adequate, but more fluent." Machine translation has gone through three paradigm shifts. When people find that it can only be a good information processing tool, and cannot really replace the human translation, they would choose the less costly.

Wei:
In any case, this small test is revealing to me. I am still feeling overwhelmed to see such a miracle live. Of course, what I have just tested is the formal style, on a computer and NLP topic, it certainly hit its sweet spot with adequate training corpus coverage. But compared with the pre-NN time when I used both Google SMT and Baidu SMT to help with my translation, this breakthrough is amazing. As a senior old school practitioner of rule-based systems, I would like to pay deep tribute to our "nerve-network" colleagues. These are a group of extremely genius crazy guys. I would like to quote Jobs' famous quotation here:

“Here's to the crazy ones. The misfits. The rebels. The troublemakers. The round pegs in the square holes. The ones who see things differently. They're not fond of rules. And they have no respect for the status quo. You can quote them, disagree with them, glorify or vilify them. About the only thing you can't do is ignore them. Because they change things. They push the human race forward. And while some may see them as the crazy ones, we see genius. Because the people who are crazy enough to think they can change the world, are the ones who do.”

@Mao, this counts as my most recent feedback to the Google scientists and their work. Last time, about a couple of months ago when they released their parser, proudly claimed to be "the most accurate parser in the world", I wrote a blog to ridicule them after performing a serious, apples-to-apples comparison with our own parser. This time, they used the same underlying technology to announce this new MT breakthrough with similar pride, I am happily expressing my deep admiration for their wonderful work. This contrast of my attitudes looks a bit weird, but it actually is all based on facts of life. In the case of parsing, this school suffers from lacking naturally labeled data which they would make use of in perfecting the quality, especially when it has to port to new domains or genres beyond the news corpora. After all, what exists in the language sea involves corpora of raw text with linear strings of words, while the corresponding parse trees are only occasional, artificial objects made by linguists in a limited scope by nature (e.g. PennTree, or other news-genre parse trees by the Google annotation team). But MT is different, it is a unique NLP area with almost endless, high-quality, naturally-occurring "labeled" data in the form of human translation, which has never stopped since ages ago.

Mao: @wei That is to say, you now embrace or endorse a neuron-based MT, a change from your previous views?

Wei:
Yes I do embrace and endorse the practice. But I have not really changed my general view wrt the pros and cons between the two schools in AI and NLP. They are complementary and, in the long run, some way of combining the two will promise a world better than either one alone.

Mao: What is your real point?

Wei:
Despite biases we are all born with more or less by human nature, conditioned by what we have done and where we come from in terms of technical background, we all need to observe and respect the basic facts. Just listen to the audio of their GSMT translation by clicking the link above, the fluency and even faithfulness to my original text has in fact out-performed an ordinary human translator, in my best judgment. If an interpreter does not have sufficient knowledge of my domain, if I give this lecture in a classroom, and ask an average interpreter to translate on the spot for me, I bet he will have a hard time performing better than the Google machine listed above (of course, human translation gurus are an exception). This miracle-like fact has to be observed and acknowledged. On the other hand, as I said before, no matter how deep the learning reaches, I still do not see how they can catch up with the quality of my deep parsing in the next few years when they have no way of magically having access to a huge labeled data of trees they depend on, especially in the variety of different domains and genres. They simply cannot "make bricks without straw" (as an old Chinese saying goes, even the most capable housewife can hardly cook a good meal without rice). Because in the natural world, there are no syntactic trees and structures for them to learn from, there are only linear sentences. The deep learning breakthrough seen so far is still mainly supervised learning, which has almost an insatiable appetite for massive labeled data, forming its limiting knowledge bottleneck.

Mao: I'm confused. Which one do you believe stronger? Who is the world's No. 0?

Wei:
Parsing-wise, I am happy to stay as No. 0 if Google insists on their being No. 1 in the world. As for MT, it is hard to say, from what I see, between their breakthrough and some highly sophisticated rule-based MT systems out there. But what I can say is, at a high level, the trends of the mainstream statistical MT winning the space both in the industry as well as in academia over the old school rule-based MT are more evident today than before.  This is not to say that the MT rule system is no longer viable, or going to an end. There are things which SMT cannot beat rule MT. For examples, certain types of seemingly stupid mistakes made by GNMT (quite some laughable examples of totally wrong or opposite translation have been illustrated in this salon in the last few days) are almost never seen in rule-based MT systems.

Dong:
here is my try of GNMT from Chinese to English:

学习上,初二是一个分水岭,学科数量明显增多,学习方法也有所改变,一些学生能及时调整适应变化,进步很快,由成绩中等上升为优秀。但也有一部分学生存在畏难情绪,将心思用在学习之外,成绩迅速下降,对学习失去兴趣,自暴自弃,从此一蹶不振,这样的同学到了初三往往很难有所突破,中考的失利难以避免。

Learning, the second of a watershed, the number of subjects significantly significantly, learning methods have also changed, some students can adjust to adapt to changes in progress, progress quickly, from the middle to rise to outstanding. But there are some students there is Fear of hard feelings, the mind used in the study, the rapid decline in performance, loss of interest in learning, self-abandonment, since the devastated, so the students often difficult to break through the third day,

Mao: This translation cannot be said to be good at all.

Wei:
Right, that is why it calls for an objective comparison to answer your previous question. Currently, as I see, the data for the social media and casual text are certainly not enough, hence the translation quality of online messages is still not their forte.  As for the previous textual sample Prof. Dong showed us above, Mao said the Google translation is not of good quality as expected. But even so, I still see impressive progress made there. Before the deep learning time, the SMT results from Chinese to English is hardly readable, and now it can generally be read loud to be roughly understood. There is a lot of progress worth noting here.

Ma:
In the fields with big data, in recent years, DL methods are by leaps and bounds. I know a number of experts who used to be biased against DL have changed their views when seeing the results. However, DL in the IR field is still basically not effective so far, but there are signs of slowly penetrating IR.

Dong:
The key to NMT is "looking nice". So for people who do not understand the original source text, it sounds like a smooth translation. But isn't it a "liar" if a translation is losing its faithfulness to the original? This is the Achille's heel of NMT.

Ma: @Dong, I think all statistical methods have this aching point.

Wei:
Indeed, there are respective pros and cons. Today I have listened to the Google translation of my blog three times and am still amazed at what they have achieved. There are always some mistakes I can pick here and there. But to err is human, not to say a machine, right? Not to say the community will not stop advancing and trying to correct mistakes. From the intelligibility and fluency perspectives, I have been served super satisfactorily today. And this occurs between two languages without historical kinship whatsoever.

Dong:
Some leading managers said to me years ago, "In fact, even if machine translation is only 50 percent correct, it does not matter. The problem is that it cannot tell me which half it cannot translate well. If it can, I can always save half the labor, and hire a human translator to only translate the other half." I replied that I am not able to make a system do that. Since then I have been concerned about this issue, until today when there is a lot of noise of MT replacing the human translation anytime from now. It's kinda like having McDonald's then you say you do not need a fine restaurant for French delicacy. Not to mention machine translation today still cannot be compared to McDonald's. Computers, with machine translation and the like, are in essence a toy given by God for us human to play with. God never agrees to permit us to be equipped with the ability to copy ourselves.

Why GNMT first chose language pairs like Chinese-to-English, not the other way round to showcase? This is very shrewd of them. Even if the translation is wrong or missing the points, the translation is usually fluent at least in this new model, unlike the traditional model who looks and sounds broken, silly and erroneous. This is the characteristics of NMT, it is selecting the greatest similarity in translation corpus. As a vast number of English readers do not understand Chinese, it is easy to impress them how great the new MT is, even for a difficult language pair.

Wei:
Correct. A closer look reveals that this "breakthrough" lies more on fluency of the target language than the faithfulness to the source language, achieving readability at cost of accuracy. But this is just a beginning of a major shift. I can fully understand the GNMT people's joy and pride in front of a breakthrough like this. In our career, we do not always have that type of moment for celebration.

Deep parsing is the NLP's crown. Yet to see how they can beat us in handling domains and genres lacking labeled data. I wish them good luck and the day they prove they make better parsers than mine would be the day of my retirement. It does not look anything like this day is drawing near, to my mind. I wish I were wrong, so I can travel the world worry-free, knowing that my dream has been better realized by my colleagues.

Thanks to Google Translate at https://translate.google.com/ for helping to translate this Chinese blog into English, which was post-edited by myself. 

 

[Related]

Wei’s Introduction to NLP Architecture Translated by Google

"OVERVIEW OF NATURAL LANGUAGE PROCESSING"

"NLP White Paper: Overview of Our NLP Core Engine"

Introduction to NLP Architecture

It is untrue that Google SyntaxNet is the "world’s most accurate parser"

Announcing SyntaxNet: The World’s Most Accurate Parser Goes Open

Is Google SyntaxNet Really the World’s Most Accurate Parser?

Dr Li's NLP Blog in English

【自然语言系统架构简说】

对于自然语言处理(NLP)及其应用,系统架构是核心问题,我在博文【立委科普:NLP 联络图 】里面给了四个NLP系统的体系结构的框架图,现在就一个一个做个简要的解说。

我把 NLP 系统从核心引擎直到应用,分为四个阶段,对应四张框架图。最底层最核心的是 deep parsing,就是对自然语言的自底而上层层推进的自动分析器,这个工作最繁难,但是它是绝大多数NLP系统基础技术。

parsing 的目的是把非结构的语言结构化。面对千变万化的语言表达,只有结构化了,patterns 才容易抓住,信息才好抽取,语义才好求解。这个道理早在乔姆斯基1957年语言学革命后提出表层结构到深层结构转换的时候,就开始成为(计算)语言学的共识了。结构树不仅是表达句法关系的枝干(arcs),还包括负载了各种信息的单词或短语的叶子(nodes)。结构树虽然重要,但一般不能直接支持产品,它只是系统的内部表达,作为语言分析理解的载体和语义落地为应用的核心支持。

接下来的一层是抽取层 (extraction),如上图所示。它的输入是结构树,输出是填写了内容的 templates,类似于填表:就是对于应用所需要的情报,预先定义一个表格出来,让抽取系统去填空,把语句中相关的词或短语抓出来送进表中事先定义好的栏目(fields)去。这一层已经从原先的领域独立的 parser 进入面对领域、针对应用和产品需求的任务了。

值得强调的是,抽取层是面向领域的语义聚焦的,而前面的分析层则是领域独立的。因此,一个好的架构是把分析做得很深入很逻辑,以便减轻抽取的负担。在深度分析的逻辑语义结构上做抽取,一条抽取规则等价于语言表层的千百条规则。这就为领域转移创造了条件。

有两大类抽取,一类是传统的信息抽取(IE),抽取的是事实或客观情报:实体、实体之间的关系、涉及不同实体的事件等,可以回答 who dis what when and where (谁在何时何地做了什么)之类的问题。这个客观情报的抽取就是如今火得不能再火的知识图谱(knowledge graph)的核心技术和基础,IE 完了以后再加上下一层挖掘里面的整合(IF:information fusion),就可以构建知识图谱。另一类抽取是关于主观情报,舆情挖掘就是基于这一种抽取。我过去五年着重做的也是这块,细线条的舆情抽取(不仅仅是褒贬分类,还要挖掘舆情背后的理由来为决策提供依据)。这是 NLP 中最难的任务之一,比客观情报的 IE 要难得多。抽取出来的信息通常是存到某种数据库去。这就为下面的挖掘层提供了碎片情报。

很多人混淆了抽取(information extraction) 和下一步的挖掘(text mining),但实际上这是两个层面的任务。抽取面对的是一颗颗语言的树,从一个个句子里面去找所要的情报。而挖掘面对的是一个 corpus,或数据源的整体,是从语言森林里面挖掘有统计价值的情报。在信息时代,我们面对的最大挑战就是信息过载,我们没有办法穷尽信息海洋,因此,必须借助电脑来从信息海洋中挖掘出关键的情报来满足不同的应用。因此挖掘天然地依赖统计,没有统计,抽取出来的信息仍然是杂乱无章的碎片,有很大的冗余,挖掘可以整合它们。

很多系统没有深入做挖掘,只是简单地把表达信息需求的 query 作为入口,实时(real time)去从抽取出来的相关的碎片化信息的数据库里,把 top n 结果简单合并,然后提供给产品和用户。这实际上也是挖掘,不过是用检索的方式实现了简单的挖掘就直接支持应用了。

实际上,要想做好挖掘,这里有很多的工作可做,不仅可以整合提高已有情报的质量。而且,做得深入的话,还可以挖掘出隐藏的情报,即不是元数据里显式表达出来的情报,譬如发现情报之间的因果关系,或其他的统计性趋势。这种挖掘最早在传统的数据挖掘(data mining)里做,因为传统的挖掘针对的是交易记录这样的结构数据,容易挖掘出那些隐含的关联(如,买尿片的人常常也买啤酒,原来是新为人父的人的惯常行为,这类情报挖掘出来可以帮助优化商品摆放和销售)。如今,自然语言也结构化为抽取的碎片情报在数据库了,当然也就可以做隐含关联的情报挖掘来提升情报的价值。

第四张架构图是NLP应用(apps)层。在这一层,分析、抽取、挖掘出来的种种情报可以支持不同NLP产品和服务。从问答系统到知识图谱的动态浏览(谷歌搜索中搜索明星已经可以看到这个应用),从自动民调到客户情报,从智能助理到自动文摘等等。

这算是我对NLP基本架构的一个总体解说。根据的是近20年在工业界做NLP产品的经验。18年前,我就是用一张NLP架构图忽悠来的第一笔风投,投资人自己跟我们说,这是 million dollar slide。如今的解说就是从那张图延伸拓展而来。

天不变道亦不变。

以前在哪里提过这个 million-dollar slide 的故事。说的是克林顿当政时期的 2000 前,美国来了一场互联网科技大跃进,史称  .com bubble,一时间热钱滚滚,各种互联网创业公司如雨后春笋。就在这样的形势下,老板决定趁热去找风险投资,嘱我对我们实现的语言系统原型做一个介绍。我于是画了下面这么一张三层的NLP体系架构图,最底层是parser,由浅入深,中层是建立在parsing基础上的信息抽取,最顶层是几类主要的应用,包括问答系统。连接应用与下面两层语言处理的是数据库,用来存放信息抽取的结果,这些结果可以随时为应用提供情报。这个体系架构自从我15年前提出以后,就一直没有大的变动,虽然细节和图示都已经改写了不下100遍了,本文的架构图示大约是前20版中的一版,此版只关核心引擎(后台),没有包括应用(前台)。话说架构图一大早由我老板寄送给华尔街的天使投资人,到了中午就得到他的回复,表示很感兴趣。不到两周,我们就得到了第一笔100万美金的天使投资支票。投资人说,这张图太妙了,this is a million dollar slide,它既展示了技术的门槛,又显示了该技术的巨大潜力。


from 科学网—前知识图谱钩沉: 信息抽取引擎的架构

【相关】

Introduction to NLP Architecture

【立委科普:NLP 联络图 】

前知识图谱钩沉: 信息抽取引擎的架构

【立委科普:自然语言parsers是揭示语言奥秘的LIGO式探测仪】 

【征文参赛:美梦成真】

《OVERVIEW OF NATURAL LANGUAGE PROCESSING》 

《NLP White Paper: Overview of Our NLP Core Engine》

White Paper of NLP Engine

【置顶:立委NLP博文】

Introduction to NLP Architecture

(translated by Google Translate, post-edited by myself)

For the natural language processing (NLP) and its applications, the system architecture is the core issue.  In my blog (  OVERVIEW OF NATURAL LANGUAGE PROCESSING), I sketched four NLP system architecture diagrams, now to be presented one by one .

In my design philosophy, an NLP process is divided into four stages, from the core engine up to the applications, as reflected in the four diagrams.  At the bottom is deep parsing, following the bottom-up processing of an automatic sentence analyzer.  This work is the most difficult, but it is the foundation and enabling technology for vast majority of NLP systems.

160213sg5p2r8ro18v17z8

The purpose of parsing is to structure unstructured text.  Facing the ever-changing language, only when it is structured in some logical form can we formulate patterns for the information we like to extract to support applications.  This principle of linguistics structures began to be the consensus in the linguistics community when Chomsky proposed the transformation from surface structure to deep structure in his linguistic revolution of 1957.  A tree representing the logical form does not only involve arcs that express syntactic-semantic relationships, but also contain the nodes of words or phrases that carry various conceptual information.  Despite the importance of such deep trees, generally they do not directly support an NLP product.  They remain only the internal representation of the parsing system, as a result of language analysis and understanding before its semantic grouding to the applications as their core support.

160216n8x8jj08qj2y1a8y

The next layer after parsing is the extraction layer, as shown in the above diagram.  Its input is the parse tree, and the output is the filled-in content of templates, similar to filling in a form: that is the information needed for the application, a pre-defined table (so to speak), so that the extraction system can fill in the blanks by the related words or phrases extracted from text based on parsing. This layer has gone from the original domain-independent parser into the application-oriented and product-demanded tasks.

It is worth emphasizing that the extraction layer is geared towards the domain-oriented semantic focus, while the previous parsing layer is domain-independent.  Therefore, a good framework is to do a very thorough analysis of logic semantics in deep parsing, in order to reduce the burden of information extraction.  With the depth of the analysis in  the logical semantic structures to support the extraction, a rule at extraction layer is in essence equivalent to thousands of surface rules at linear text layer.  This creates the conditions for the efficient porting to new domains based on the same core engine of parsing.

There are two types of extraction, one is the traditional information extraction (IE), the extraction of facts or objective information: named entities, the relationships between entities, and events involving entities (which can answer questions like "who did what when and where" and the like).  This extraction of objective information is the core technology and foundation for the knowledge graph (nowadays such a hot area in industry).  After completion of IE, the next layer of information fusion (IF) is aimed at constructing the knowledge graph.   The other type of extraction is about subjective information, for example, the public opinion mining is based on this kind of extraction. What I have done over the past five years as my focus is along this line for fine-grained extraction of public opinions (not just sentiment classification, but also to explore the reasons behind the public opinions and sentiments to provide the insights basis for decision-making).  This is one of the hardest tasks in NLP, much more difficult than IE for objective information.  Extracted information is usually stored in a database. This provides huge textual mentions of information to feed the underlying mining layer.

Many people confuse information extraction and text mining, but, in fact, they are two levels of different tasks.  Extraction faces each individual language tree, embodied in each sentence, in order to find the information we want.  The mining, however, faces a corpus, or data sources as a whole, from the language forest for gathering statistically significant insights.  In the information age, the biggest challenge we face is information overload, we have no way to exhaust the information ocean for the insights we need, therefore, we must use the computer to dig out the information from the ocean for the required critical intelligence to support different applications. Therefore, mining relies on natural statistics, without statistics, the information is still scattered across the corpus even if it is identified.  There is a lot of redundancy in the extracted mentions of information, mining can integrate them into valuable insights.

160215hzp5hq5pfd1alldj

Many NLP systems do not perform deep mining, instead, they simply use a query to search real-time from the extracted information index in the database and merge the retrieved information on-the-fly, presenting the top n results to the user. This is actually also mining, but it is a way of retrieval to achieve simple mining for directly supporting an application.

In order to do a good job of mining, there is a lot of work that can be done in this mining layer. Text mining not only improves the quality of existing extracted information pieces, moreover, it can also tap the hidden information, that is not explicitly expressed in the data sources, such as the causal relationship between events, or statistical trends of the public opinions or behaviours. This type of mining was first done in the traditional data mining applications as the traditional mining was aimed at structured data such as transaction records, making it easy to mine implicit associations (e.g., people who buy diapers often buy beer, this reflects the common behaviours of young fathers of the new-born, and such hidden association can be mined to optimize the layout and sales of goods). Nowadays, natural language is also structured thanks to deep parsing, hence data mining algorithms for hidden intelligence in the database can, in principle, also be applied to enhance the value of intelligence.

The fourth architectural diagram is the NLP application layer. In this layer, the results from parsing, extraction, and mining out of the unstructured text sources can be used to support a variety of NLP products and services, ranging from the QA (question answering) systems to the dynamic construction of the knowledge graph (this type of graph is visualized now in the Google search when we do a search for a star or VIP), from automatic polling of public opinions to customer intelligence about brands, from intelligent assistants (e.g. chatbots, Siri etc.) to automatic summarization and so on.

16221285l5wkx8t5ffi8a9

This is my overall presentation of the basic architecture of NLP and its applications, based on nearly 20 years of experiences in the industry to design and develop NLP products.  About 18 years ago, I was presenting a similar diagram of the NLP architecture to the first venture investor who told us that this is a million dollar slide.  The presentation here is a natural inheritance and extension from that diagram.

~~~~~~~~~~~~~~~~~~~
Here is the previously mentioned million-dollar slide story.  Under the Clinton's administration before the turn of the century, the United States went through a "great leap forward" of the Internet technology, known as Dot Com Bubble, a time of hot money pouring into the IT industry while all kinds of Internet startups were springing up.  In such a situation, my boss decided to seek venture capital for the business expansion, and requested me to illustrate our prototype of the implemented natural language system for its introduction.  I then drew the following three-tier structure of an NLP system diagram: the bottom layer is parsing, from shallow to deep, the middle is built on parsing for information extraction, and the top layer illustrates some major categories of NLP applications, including QA.  Connecting applications and the downstairs two layers of language processing is the database, used to store the results of information extraction, ready to be applied at any time to support upstairs applications.  This general architecture has not changed much since I made it years ago, although the details and layout have been redrawn no less than 100 times.  The architecture diagram below is about one of the first 20 editions, involving mainly the backend core engine of information extraction architecture, not so much on the front-end flowchart for the interface between applications and the database.  I still remember early in the morning, my boss sent the slide to a Wall Street angel investor, by noon we got his reply, saying that he was very interested.  Less than two weeks, we got the first million dollar angel investment check.  Investors label it as a million dollar slide, which is believed to have not only shown the depth of language technology but also shows the great potential for practical applications.

165325a3pamcdcdr3daapw

Pre-Knowledge Graph: Architecture of Information Extraction Engine

 

【Related Chinese Blogs】

NLP Overview

Pre-Knowledge Graph: The Architecture of Information Extraction Engine

Natural language parser is to reveal the mystery of the language like a LIGO-type detector

Dream come true

( translated from http://blog.sciencenet.cn/blog-362400-981742.html )

The speech generation of the fully automatically translated, un-edited science blog of mine is attached below (for your entertainment :=), it is amazingly clear and understandable (definitely clearer than if I were giving this lecture myself with my strong accent).  If you are an NLP student, you can listen to it as a lecture note from a seasoned NLP practitioner.

Thanks to the newest Google Translate service from Chinese into English at https://translate.google.com/ 

 

 

[Related]

Wei’s Introduction to NLP Architecture Translated by Google

"OVERVIEW OF NATURAL LANGUAGE PROCESSING"

"NLP White Paper: Overview of Our NLP Core Engine"

【立委科普:谷歌NMT,见证奇迹的时刻】

微信最近疯传人工智能新进展:谷歌翻译实现重大突破!值得关注和庆贺。mt 几乎无限量的自然带标数据在新技术下,似乎开始发力。报道说:

十年前,我们发布了 Google Translate(谷歌翻译),这项服务背后的核心算法是基于短语的机器翻译(PBMT:Phrase-Based Machine Translation)。

自那时起,机器智能的快速发展已经给我们的语音识别和图像识别能力带来了巨大的提升,但改进机器翻译仍然是一个高难度的目标。

今天,我们宣布发布谷歌神经机器翻译(GNMT:Google Neural Machine Translation)系统,该系统使用了当前最先进的训练技术,能够实现到目前为止机器翻译质量的最大提升。我们的全部研究结果详情请参阅我们的论文《Google`s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation》。

几年前,我们开始使用循环神经网络(RNN:Recurrent Neural Networks)来直接学习一个输入序列(如一种语言的一个句子)到一个输出序列(另一种语言的同一个句子)的映射。其中基于短语的机器学习(PBMT)将输入句子分解成词和短语,然后在很大程度上对它们进行独立的翻译,而神经机器翻译(NMT)则将输入的整个句子视作翻译的基本单元。

这种方法的优点是:相比于之前的基于短语的翻译系统,这种方法所需的工程设计更少。当其首次被提出时,NMT 在中等规模的公共基准数据集上的准确度,就达到了可与基于短语的翻译系统媲美的程度。

自那以后,研究者已经提出了很多改进 NMT 的技术,其中包括模拟外部对准模型(external alignment model)来处理罕见词,使用注意(attention)来对准输入词和输出词 ,以及将词分解成更小的单元应对罕见词。尽管有这些进步,但 NMT 的速度和准确度还没能达到成为 Google Translate 这样的生产系统的要求。

我们的新论文描述了怎样克服让 NMT 在非常大型的数据集上工作的许多挑战、如何打造一个在速度和准确度上都足够能为谷歌 用户和服务带来更好的翻译体验的系统。

来自对比评估的数据,其中人类评估者对给定源句子的翻译质量进行比较评分。得分范围是 0 到 6,其中 0 表示「完全没有意义的翻译」,6 表示「完美的翻译」。

............

使用人类评估的并排比较作为一项标准,GNMT 系统得出的翻译相比于之前基于短语的生产系统有了极大提升。

在双语人类评估者的帮助下,我们在来自维基百科和新闻网站的样本句子上测定发现:GNMT 在多个主要语言对的翻译中将翻译误差降低了 55%-85% 甚至更多。

今天除了发布这份研究论文之外,我们还宣布将 GNMT 投入到了一个非常困难的语言(汉语-英语)的翻译的生产中。

现在,移动版和网页版的 Google Translate 的汉英翻译已经在 100% 使用 GNMT 机器翻译了——每天大约 1800 万条翻译。GNMT 的生产部署是使用我们公开开放的机器学习工具套件 TensorFlow 和我们的张量处理单元(TPU:Tensor Processing Units),它们为部署这些强大的 GNMT 模型提供了足够的计算算力,同时也满足了 Google Translate 产品的严格的延迟要求。

汉语到英语的翻译是 Google Translate 所支持的超过 10000 种语言对中的一种,在未来几个月,我们还将继续将我们的 GNMT 扩展到远远更多的语言对上。

from 谷歌翻译实现重大突破

作为老机译,不能不被吸引。准备小试一下这最新版的谷歌神经翻译。
此前试过谷歌在线翻译,总体不如百度,可现如今说汉语mt已经很神经了:深度神经,接近人类。我有几百篇待译 正好一试,先试为快。期待谷歌的神译。

董:
@wei 但愿不致让你失望。我曾半开玩笑地说:规则机译是傻子,统计机译是疯子,现在我继续调侃:神经机译是“骗子”(我绝不是指研发者)。语言可不是猫脸或马克杯之类的,仅仅表面像不行,内容也要像!

我:现在是见证奇迹的时刻:

The automatic speech generation of this science blog of mine is attached here, it is amazingly clear and understandable. If you are an NLP student, you can listen to it as a lecture note from a seasoned NLPer (definitely clearer than if I were giving this lecture myself with my strong accent).   More amazingly, the original blog was in Chinese and I used the newest Google Translate claimed to be based on deep learning using sentence-based translation as well as character-based techniques.  My original blog in Chinese is here, you can compare:【立委科普:自然语言系统架构简说】。

董老师,您是知道我的背景和怀疑的。但是,面对这样的进步,这种远远超出我们当初入行的时候可以想象的极限的自动翻译质量和鲁棒性,我们不得不,不得不,不得不叹服。

董:
用他们的术语叫“less adequate,but more fluent”。机译已经历了三次paradigm shift,当人们发现无论如何它只能是一种很好的信息处理工具,而无法取代人类翻译时,那就选取代价较少的。

我:
无论如何,这个小小的测试让我这个老机译有点瞠目结舌。还没有从这种冲击回过味来。当然,赶巧我测试的是正规文体,谈的又是电脑和NLP话题,肯定在语料库的涵盖范围内,撞上枪眼了。可比起此前我使用的前神经时代的谷歌SMT和百度SMT,这个飞跃式突破还是让人惊异的。向我们的神经同行致敬。这是一批绝顶聪明的疯子。

毛老,这是我对谷歌最近的 claim 的一个反馈。上次对他们的 parser 嘲笑了一通,这次对他们用同样技术带来的MT的突破,表达一下深深的敬佩。这种 contrast 不是我神经了,或分裂了,而是在 parsing,他们苦于没有自然带标数据,巧妇难为无米之炊,因此无法跟符号逻辑派比试。可是 MT 不同,几乎有无穷无尽的自然带标数据(人的翻译从来没有终止过,留下的对照译文浩如烟海)。

毛: @wei 这就是说,你服了基于神经元的MT,改变了自己的见解和主张?

我: 我服了,但没真地改变。

毛: 怎么说?

我:
无论多少门户之见,基本事实总应该看清吧。听一下上面列出的他们的 SMT 翻译,其流利程度和对我原文的忠实程度,已经超过了一个普通的人做翻译。因为一个口译如果不懂我这一行,我如果拿着这个讲稿讲课,让这样的 average interpreter 做现场翻译,他是比不过机器的,无论信还是达。(翻译高手不论。)这一点不得不服。另一方面,我以前说的,神经再深度,我看不出它在今后几年内可以赶上我的深度 parser,这一点还没改变。尤其是应对不同 domains 和 genres 的能力,他们不可能做到。因为在自然世界里面,没有带标的句法树,有的只是线性句子。而迄今看到的突破都是有监督的深度学习(supervised learning),离开海量带标数据(labeled data)就抓瞎了。

毛: 我被你弄糊涂了。你究竟是说哪一派强哪?@wei 究竟谁是世界第0?

我: parsing 我是第0,谷歌赶不上。MT 谷歌有重大突破,估计符号逻辑派的 MT 的日子不好过。

毛: 我问的是,MT谁是第0,不管用什么方法。

我: 这不是说,MT 规则系统就没有活路了,但是总体而言,SMT(statistical MT)占据上风的 trends 还在增强。

云: THKS. 我来试试能不能翻译我写的公司白皮书?

我:
你要是加一点译后人工编辑的话,我估计会很好的。再不要傻傻地从头请人工做翻译了。翻译公司如果不使用 MT 做底,将会被淘汰,成本上看很难存活。

董:
学习上,初二是一个分水岭,学科数量明显增多,学习方法也有所改变,一些学生能及时调整适应变化,进步很快,由成绩中等上升为优秀。但也有一部分学生存在畏难情绪,将心思用在学习之外,成绩迅速下降,对学习失去兴趣,自暴自弃,从此一蹶不振,这样的同学到了初三往往很难有所突破,中考的失利难以避免。
Learning, the second is a watershed, the number of subjects increased significantly, learning methods have also changed, some students can adjust to adapt to changes in progress, progress quickly, from the middle to rise to outstanding. But there are some students there is fear of hard feelings, the mind used in the study, the rapid decline in performance, loss of interest in learning, self-abandonment, since the devastated, so the students often difficult to break through the third day,

毛: 这翻译没什么好呀?

我:
要的就是这句话 🙂 @毛,需要一个对比,才能回答你的问题。

毛: 那就拿你的出来比比?

我: 我早就不做 MT 了,我是逃兵。近20年前,我就转移到信息抽取 IE(以及sentiment)挖掘了,这方面我有底气,不怕比。

刘:转:谷歌新版翻译有多神?英文教授亲证后告诉你...

我: thanks 似乎评论得比较中肯。对于口语现在肯定还不行,它的训练集一直对口语 cover 的很不够。以前测过,有些常用的简单的口语表达也弄错。不知这次这方面加强多少。

董老师上面给的那段谷歌翻译的段落,毛老说翻译不咋样。不过我做过MT多年,知道达到这一步其实是有很大进步的。以前的汉译英的不可读,到现在读起来大体能听得懂。这里面有很多的进步值得留意。

刘: @wei 转一个: 话说大数据干的一些事属于人工智能操练(不能用“研究”这词了)的范畴吗,那本来不就是传统计算机系的人干的?动不动嘲笑每开掉一个语言学家就往前多走几步这眼界太浅了

马: 在数据充足的领域,这几年DL方法都突飞猛进,我认识的好几个以前对DL有偏见的都多少改变了看法。在IR领域DL还基本不见功效,但也在慢慢渗透中。

毛: 不认同“传统计算机系”这个话。计算机系应该跟着实践走,而不是反过来。

董:
NMT的关键是个“像”。于是出了一个有时不懂原文的人以为翻的很顺溜。没了信的翻译,不就成骗子了吗?如何知道自己的翻译是满拧了?这也是NMT的死穴。

马: 董老师,我觉得统计方法都有这个死穴

我:
寸有所长尺有所短,这也不奇怪。我今天自己听这个对我的blog的翻译已经三篇了,一步一叹。NND 咋这么溜呢。找碴子 找翻译错 总是有的。可是人也有译错啊。从可懂度和流畅程度看,我反正是服了。而这发生在没有亲属关系的两个语言之间。

董:
想当年有的领导干部对我说,“其实机器翻译只有百分之50的正确性,也不要紧,问题是你能不能把那个一半告诉我,我可以找人专翻那部分。”我回答说我做不到。从那时起我一直在关注这个问题。直到如今很多人在叫嚷要取代人工翻译了。这真有点像有了麦当劳就说不要法式大餐了一样。何况机译还做不到麦当劳。计算机、以致机译是上帝给人类玩的,上帝没有给人类那种可以复制自己的本领。

洪:

我的观点很简单:
影子不能三维变。
人若二维非压扁,
自叹弗如影子前。

人工智能影子般,
随人活动数据攒。
深度学习模型建,
类似皮影戏好玩。

董:
是的。我曾对照过10多本英国名著,曾经发现有一本译著明显的是译者故意大段大段地漏译的,那里面有太多的花草等,估计大师懒得查。就不译了。

为什么GNMT首先选择的语言对是汉英,而不是英汉?这是非常精明的。人工翻译即使错了或漏了,译文通常会是顺溜的,至少绝不会像传统的机译那样有傻又疯的,诘屈聱牙的,而这正是NMT的特点,它挑选的是译文中最大相似的。那样一来广大的英语读者,多数不懂中文,就容易被它“唬住”了。

我:
对。仔细看来,这次“突破”是达有余而信不足,矫枉过正了。
但一切才刚开始。我可以理解做NMT的人面对突破的欣喜心情。

洪:
伟爷久玩nlp,
一直孤傲头不低。
今朝服膺叹奇迹,
深度神经已皈依!

我:
皈依还不至于,也不够格。赞佩是由衷的,希望今后有合作的机会,取长补短,达成共赢。人家要是看不上咱呢,咱就单干。deep parsing 是 NLP 的皇冠。神经 parsing 何时全方位超过在下,咱就退休。现在仍然觉得,照这个标准,估计这辈子也退休不了。但愿自己错了,可以提早周游世界。

 

【相关】

Wei’s Introduction to NLP Architecture

谷歌翻译实现重大突破

谷歌新版翻译有多神?英文教授亲证后告诉你...

立委科普:NLP 联络图】(姐妹篇)

机器翻译

Wei's Introduction to NLP Architecture Translated by Google

Introduction to NLP Architecture
by Dr. Wei Li
(fully automatically translated by Google Translate)

The automatic speech generation of this science blog of mine is attached here, it is amazingly clear and understandable, if you are an NLP student, you can listen to it as a lecture note from a seasoned NLPer (definitely clearer than if I were giving this lecture myself with my strong accent):

To preserve the original translation, nothing is edited below.  I will write another blog to post-edit it to make this an "official" NLP architecture introduction to the audiences perused and honored by myself, the original writer.  But for time being, it is completely unedited, thanks to the newly launched Google Translate service from Chinese into English at https://translate.google.com/ 

[Legislature science: natural language system architecture brief]

For the natural language processing (NLP) and its application, the system architecture is the core issue, I blog [the legislature of science: NLP contact diagram] which gave four NLP system architecture diagram, now one by one to be a brief .
I put the NLP system from the core engine to the application, is divided into four stages, corresponding to the four frame diagram. At the bottom of the core is deep parsing, is the natural language of the bottom-up layer of automatic analyzer, this work is the most difficult, but it is the vast majority of NLP system based technology.

160213sg5p2r8ro18v17z8

The purpose of parsing is to structure unstructured languages. The face of the ever-changing language, only structured, and patterns can be easily seized, the information we go to extract semantics to solve. This principle began to be the consensus of (linguistics) when Chomsky proposed the transition from superficial structure to deep structure after the linguistic revolution of 1957. A tree is not only the arcs that express syntactic relationships, but also the nodes of words or phrases that carry various information. Although the importance of the tree, but generally can not directly support the product, it is only the internal expression of the system, as a language analysis and understanding of the carrier and semantic landing for the application of the core support.

160216n8x8jj08qj2y1a8y

The next layer is the extraction layer (extraction), as shown above. Its input is the tree, the output is filled in the content of the templates, similar to fill in the form: is the information needed for the application, pre-defined a table out, so that the extraction system to fill in the blank, the statement related words or phrases caught out Sent to the table in the pre-defined columns (fields) to go. This layer has gone from the original domain-independent parser into the face-to-face, application-oriented and product-demanding tasks.
It is worth emphasizing that the extraction layer is domain-oriented semantic focus, while the previous analysis layer is domain-independent. Therefore, a good framework is to do a very thorough analysis of logic, in order to reduce the burden of extraction. In the depth analysis of the logical semantic structure to do the extraction, a rule is equivalent to the extraction of thousands of surface rules of language. This creates the conditions for the transfer of the domain.
There are two types of extraction, one is the traditional information extraction (IE), the extraction of fact or objective information: the relationship between entities, entities involved in different entities, such as events, can answer who dis what when and where When and where to do what) and the like. This extraction of objective information is the core technology and foundation of the knowledge graph which can not be renewed nowadays. After completion of IE, the next layer of information fusion (IF) can be used to construct the knowledge map. Another type of extraction is about subjective information, public opinion mining is based on this kind of extraction. What I have done over the past five years is this piece of fine line of public opinion to extract (not just praise classification, but also to explore the reasons behind the public opinion to provide the basis for decision-making). This is one of the hardest tasks in NLP, much more difficult than IE in objective information. Extracted information is usually stored in a database. This provides fragmentation information for the underlying excavation layer.
Many people confuse information extraction and text mining, but in fact this is two levels of the task. Extraction is the face of a language tree, from a sentence inside to find the information you want. The mining face is a corpus, or data source as a whole, from the language of the forest inside the excavation of statistical value information. In the information age, the biggest challenge we face is information overload, we have no way to exhaust the information ocean, therefore, must use the computer to dig out the information from the ocean of critical intelligence to meet different applications. Therefore, mining rely on natural statistics, there is no statistics, the information is still out of the chaos of the debris, there is a lot of redundancy, mining can integrate them.

160215hzp5hq5pfd1alldj

Many systems do not dig deep, but simply to express the information needs of the query as an entrance, real-time (real time) to extract the relevant information from the fragmentation of the database, the top n results simply combined, and then provide products and user. This is actually a mining, but is a way to achieve a simple search mining directly support the application.
In fact, in order to do a good job of mining, there are a lot of work to do, not only can improve the quality of existing information. Moreover, in-depth, you can also tap the hidden information, that is not explicitly expressed in the metadata information, such as the causal relationship between information found, or other statistical trends. This type of mining was first done in traditional data mining because the traditional mining was aimed at structural data such as transaction records, making it easy to mine implicit associations (eg, people who buy diapers often buy beer , The original is the father of the new people's usual behavior, such information can be excavated to optimize the display and sale of goods). Nowadays, natural language is also structured to extract fragments of intelligence in the database, of course, can also do implicit association intelligence mining to enhance the value of intelligence.
The fourth architectural diagram is the NLP application layer. In this layer, analysis, extraction, mining out of the various information can support different NLP products and services. From the Q & A system to the dynamic mapping of the knowledge map (Google search search star has been able to see this application), from automatic polling to customer intelligence, from intelligent assistants to automatic digest and so on.

16221285l5wkx8t5ffi8a9

This is my overall understanding of the basic architecture of NLP. Based on nearly 20 years in the industry to do NLP product experience. 18 years ago, I was using a NLP structure diagram to the first venture to flicker, investors themselves told us that this is million dollar slide. Today's explanation is to extend from that map to expand from.
Days unchanged Road is also unchanged.

Where previously mentioned the million-dollar slide story. Clinton said that during the reign of 2000, the United States to a great leap forward in Internet technology, known as. Com bubble, a time of hot money rolling, all kinds of Internet startups are sprang up. In such a situation, the boss decided to hot to find venture capital, told me to achieve our prototype of the language system to do an introduction. I then draw the following three-tier structure of a NLP system diagram, the bottom is the parser, from shallow to deep, the middle is built on parsing based on information extraction, the top of the main categories are several types of applications, including Q & A system. Connection applications and the following two language processing is the database, used to store the results of information extraction, these results can be applied at any time to provide information. This architecture has not changed much since I made it 15 years ago, although the details and icons have been rewritten no less than 100 times. The architecture diagram in this article is about one of the first 20 editions. Off the core engine (background), does not include the application (front). Saying that early in the morning by my boss sent to Wall Street angel investors, by noon to get his reply, said he was very interested. Less than two weeks, we got the first $ 1 million angel investment check. Investors say that this is a million dollar slide, which not only shows the threshold of technology, but also shows the great potential of the technology.

165325a3pamcdcdr3daapw

Pre - Knowledge Mapping: The Structure of Information Extraction Engine

【Related】
[Legislature science: NLP contact map (one)]
Pre - Knowledge Mapping: The Architecture of Information Extraction Engine
[Legislature science: natural language parsers is to reveal the mystery of the language LIGO-type detector]
【Essay contest: a dream come true
"OVERVIEW OF NATURAL LANGUAGE PROCESSING"

"NLP White Paper: Overview of Our NLP Core Engine"

White Paper of NLP Engine

"Zhaohua afternoon pick up" directory

[Top: Legislative Science Network blog NLP blog at a glance (regularly updated version)]

nmt1

nmt2

nmt3

nmt4

nmt5

nmt6

nmt7

retrieved 10/1/2016 from https://translate.google.com/

translated from http://blog.sciencenet.cn/blog-362400-981742.html

【李白对话录之六:NLP 的 Components 及其关系】

白:
“交杯酒”,似乎“交杯”修饰的不是“酒”。“散伙饭”比“交杯酒”好点,可能“饭”单指饭局比“酒”单指敬酒频率要高些。

我:
这不就是一个黑箱子吗,里面啥关系对语义计算有什么用呢?如果有用,那就在词典绑架性标出,如果没用,就不管它。“交杯酒” 与 “酒” 的不同,是前者有个坑 【with+human】:“与张三的交杯酒刚喝过,李四就跟他掰了。” 后者似乎也可以,但那个似乎随机性很强或者后者指的是前者的时候:“与张三的酒刚喝过。。。”

白:
考虑创造新说法的安全性和可接受性,这问题不能绑架了之。见面礼,也属此类。现在流行的“谢师宴”,若干年前肯定是不说的。如何“安全地泛化”,对于语言生成来说是新课题。

我:
如果说的是语言生成,譬如在机器翻译应用,那么,一个系统有选择的余地。不必要翻译成一个短小紧凑的 【合成词】 的表达方式。可以用比较散漫的句法表达方式,这样相对保险,也规避了 word formation 的泛化的问题,因为句法的本性就是泛化和随机,构词法则不然。“谢师宴” 可以表达成 “感谢恩师的宴会”。

白:
人机对话不同
需要惊喜

我:
白老师看的是未来,锦上添花的东西。现如今雪中送炭还远没解决。
如果是 parsing,这种泛化的合成词的确词典收不胜收。汉语的造词能力特强,需要专门的 compounding 的模块去识别。

白:
单字形容词重叠+的,应该是系统性的现象。词典化怎么看都不像正路子。

我:
应该是两手吧。常用的 aa 重叠,尤其是双字的合成词,常规词典有收。系统的 rule 也必须有,娄底,保证recall。何况 “美” 与 “美美” 不是 1+1 的关系。可以被 “美美” 所状的谓词,单个的 “美” 根本不可能,无论睡觉、吃饭。同理,“好好” 与 “好” 也差别很大。可是 “幸幸福福” 与 “幸福” 就完全是规律化、系统性的现象了。即便用法有别,也是系统性地有别。这与 “好好、“美美” 不同。

白:
美美,是当事人感觉美。好好,是提要求/愿望的人觉得满足要求。轻轻,是行动人的身体或者行动人操控的物品宛若很轻。这一切基本与谓词无关。

我:
“美美睡上一觉”;“睡一个美美的觉。”
说与谓词无关,谓词不同意吧。
如果无关,辛勤与工作也无关。辛勤也是说人,工作也是说人,当“辛勤”修饰(状语)“工作”的时候,两个人是一个人。
如果说句法上的修饰关系,到了逻辑语义层不应该有所反映的话,那么逻辑语义表达中就不存在定语从句的路径。那么,“我说的话” 与 “我说话” 的语义区别咋办呢?
目前我们的处理是:“我说的话” 有一个小句“我说话”,这个小句有一个修饰的路径(Mod-S)指向“话”。

0928b

白:
逻辑语义是一个“结构体”,标配是负载最外层结构的词对外。需要其他词对外时,就需要定语从句来改变。所以定语从句不改变结构内逻辑语义关系,只是指派了不同词语来“代表”结构对外而已。S-mod是一个句法关系,不是逻辑语义关系。

我:
我对逻辑语义的理解是宽式的,涵盖一切由语句表达的涉及概念之间关系的语义,表达的是人对语句的理解。有了逻辑语义,再加上节点上的概念(从语词到概念的映射,理论上要经过 WSD),就可以说人理解语言了。如果是机器达到了上面两项,那就是机器的自然语言理解了。从这个角度看,定语从句不仅仅是表层的句法关系,它也是深层的语义关系(在 argument structure 以外的另一个维度)。

白:
“吃饭”,和“吃的饭”,吃与饭的逻辑语义关系不变,只是这个结构体的对外代言人,一个落在“吃”,一个落在“饭”。

我:
没错,吃饭 与 吃的饭 在 arg structure 维度,逻辑语义完全一致。也因此我们的 deep parse tree 上,下面的小句是等同的。跳出这个SVO结构体以外的维度,或者说,这个 SVO 与其他 SVO 发生关联的时候,这种关联也是语言理解的必需,也是语义。至于这个语义及其形式化的表达,叫不叫逻辑语义,那是命名的问题。但它的确是理解的必需,也的确是语义,也不能说不逻辑。对于 “我喜欢吃饭”,这个“吃饭” 的 arg structure 就直接做了 “喜欢” 的对象,到了 “我喜欢我吃的饭”,这个 arg structure 就只能降一级,然后通过 “饭” 来做喜欢的对象。逻辑上,arg structure 只是一个最基本的事件语义元件。

白:
结构体能做萝卜的成分有多个,包括最外层谓词本身。这并没超出逻辑语义范围。真正超出的是语用,比如定语从句有“造成既成事实”因而“强加于人”的感觉。

我:
由于这些元件的叠加所产生的语义,在语言表达中有很多方式,而且语言的节省(或偷懒)原则,使得这些元件的坑里面的萝卜往往省略,造成了语言向逻辑映射的困难,构成了 deep parsing 的挑战。说定语从句是句法形式,表达的是语用,而不是语义。这种说法,可成一家之言。可是,语义和语用本来的界限就有相当的灰色地带,哪些东西可以从语用的边界拉到语义这边,哪些东西可以在语义中挂起来,留待语用去解,都是一个公婆各有理的 practice,实践中就是系统内部(system internal)的协调。

白:
“惯于充当世界警察的美国”,就有把“美国惯于充当世界警察”当作既成事实强加给受众的意思。

我:
我个人的原则是,凡是 domain independent 的,都应该在语义里面表达和求解。凡是关涉 domain 的,或应用的,那就先挂起来,留到语用去解。这是把语用(pragmatics)与应用(apps)紧密联系了。定语从句是独立于 domain 的,不管哪个 domain 哪个应用,定语从句所表达的那点意思,都是一样的。当然会遇到有些语义计算的现象,它似乎有某种 domain 独立性,但又不完全。这时候的定夺就有任意性。主张在语义阶段求解的,加重的是语义 component 的负担,节省的是那些对此现象有要求的 domains 的 work,浪费表现在对于那些对此工作没有要求的 domains,那个语义工作就是白费功夫了。

白:
你这语用不是pragmatics,而是language usage。pragmatics一定是领域无关的。但pragmatics同样独立于逻辑语义结构

我:
前者我不确定,也许 community 对 pragmatics 的理解是你所说的,那样的话,“我的语义定义”里面是包含了这部分的。后者好像不对, language usage 一般指的是纯粹语言学的句法、词法、习惯用法等表层的现象。language usage 不是一个语言学的相对独立完整的 component。

白:
或者application,反正不是pragmatics的意思。这误会不是一天两天了,之前听你说语用我就感觉对不上茬儿。

我:
哈哈。
用的不是同一套话语体系,交流的时候要是不相互了解的话,的确很别扭。
举个具体的案例。费尔默上个世纪70年代(?)提出格语法(Case Grannar,深层格)的时候,我的理解,这是语义。实际上就是逻辑语义。当他不断把这条路线往下进行的时候,终于越来越趋近语用,最后成就的 FrameNet 在我的理解体系里面是属于 “语义” 向 “语用” 过度的结果(因此我一直批评它说在NLP中不尴不尬),但仍然基本上是领域独立的,可以划在语义的大范畴里面。可是,到了 MUC 创立了 IE 的时候,这就不再是领域独立了,于是就完完全全成了语用。信息抽取领域定义的那些关于事件和关系(知识图谱的发源)的 Templates, 与费尔默定义的 FrameNet,从形式上看,是一体的。不过在 FrameNet 里面,成千的 Frames 组织成了一个基本独立于领域的 hierarchy ,到了 IE 就完全放弃了自上而下的这种继承,一切都是零敲碎打,随着领域 随着应用随时拼凑起来的 Templates,直接服务于产品。

白:
指代消解,言外之意推导等,这些才是“语用学”环节要解决的问题。
你在美国,只要不用pragmatics指你说的语用,啥事没有,但对国内的人就不同了。你说的语用,在国内叫知识表示。

我:
指代在我的“科普”体系里面,是另一个 component,属于 Discourse,那是另一个维度,是篇章的维度。知识表示 (knowledge representation)有两大类吧,一类就是 ontology,有普世的,譬如董老师的 HowNet,也有领域的,譬如,医学领域的 ontology(本体)。另一类知识表示是动态的、流动的,就是现在红透半边天的知识图谱,其基础就是 IE,加上 篇章的、跨文本的一些工作支持的融合(fusion),包括merging,deconflicitng 等 mining。

白:
句内也有指代,无需篇章。

我:
句内也有指代,所以才有以句子为最大单位的句法的与之交互。交互的结果就是乔老爷的所谓 Binding Theory 或 Principles,但是指代在借助句法做了句内的指代以后,自然的下一步是走向篇章。事实上 乔老爷的 binding 原则之一,就是把句法搞不定的,推向篇章。那个原则是,本句的某个 NP 不可能是这个指代词 (“自己”、“他”)的所指。根据这一原则,句法的作用只是否定了一种可能,留下的是其他的可能,让篇章去寻。

我的相关科普在:【立委科普:NLP 联络图 】(英文在:  OVERVIEW OF NATURAL LANGUAGE PROCESSING )。里面把与 NLP 有关的语言学 components 按照我自己的理解,梳理了一遍。

 

【相关】

Not an ad. But a historical record.

Although not updated for long, this wiki remains like this until today 9/28/2016
from https://en.wikipedia.org/wiki/NetBase_Solutions,_Inc.

wikinetbase

NetBase Solutions, Inc.

From Wikipedia, the free encyclopedia
  (Redirected from NetBase)
NetBase Solutions, Inc.
Private
Industry Market Research
Founded 2004
Founder Jonathan Spier and Michael Osofsky
Headquarters Mountain View, CA, USA
Area served
Worldwide
Key people
Peter Caswell, CEO
Mark Bowles, CTO
Lisa Joy Rosner, CMO
Dr. Wei Li, Chief Scientist
Products NetBase Insight Workbench
Website www.netbase.com

NetBase Solutions, Inc. is a Mountain View, CA based developer of natural language processing technology used to analyze social media and other web content. It was founded by two engineers from Ariba in 2004 as Accelovation, before changing names to NetBase in 2008. It has raised a total of $21 million in funding. It's sold primarily on a subscription basis to large companies to conduct market research and social media marketing analytics. NetBase has been used to evaluate the top reasons men wear stubble, the products Kraft should develop and the favorite tech company based on digital conversations.

History

NetBase was founded by Jonathan Spier and Michael Osofsky, both of whom were engineers at Ariba, in 2004 as Accelovation, based on the combination of the words “acceleration” and “innovation.”[1][2] It raised $3 million in funding in 2005, followed by another $4 million in 2007.[1][3] The company changed its name to NetBase in February 2008.[4][5]

It developed its analytics tools in March 2010 and began publishing monthly brand passion indexes (BPI) comparing brands in a market segment using the tool shortly afterwards.[6] In 2010 it raised $9 million in additional funding and another $2.5 million in debt financing.[1][3] NetBase Insight Workbench was released in March 2011 and a partnership was formed with SAP AG that December for SAP to resell NetBase's software.[7] In April 2011, a new CEO Peter Caswell was appointed.[8] Former TIBCO co-inventor, patent author and CTO Mark Bowles is now the CTO at NetBase and held responsible for many technical achievements in scalability.[9]

Software and services

Screenshot of NetBase Insight Workbench dashboard

NetBase sells a tool called NetBase Insight Workbench that gives market researchers and social marketers a set of analytics, charts and research tools on a subscription basis. ConsumerBase is what the company calls the back-end that collects and analyzes the data. NetBase targets market research firms and social media marketing departments, primarily at large enterprises with a price-point of around $100,000.[10][11] NetBase is also white-labeled by Reed Elsevier in a product called illumin8.[12]

Uses

For the average NetBase user, 12 months of activity is twenty billion sound bytes from just over seven billion digital documents. The company claims to index 50,000 sentences a minute from sources like public-facing Facebook, blogs, forums, Twitter and consumer review sites.[13][14]

According to a story in InformationWeek, Kraft uses NetBase to measure customer needs and conduct market research for new product ideas.[15] In 2011 the company released a report based on 18 billion postings over twelve months on the most loved tech companies. Salesforce.com, Cisco Systems and Netflix were among the top three.[16] Also in 2011, NetBase found that the news of Osama Bin Laden eclipsed the royal wedding and the Japan earthquake in online activity.[17]

External links

References

  1. ^ Jump up to:a b c By Matt Marshall, VentureBeat. “Accelovation Raises $4M for online software for IT market research.” December 3, 2007.
  2. Jump up^ BusinessWeek profile
  3. ^ Jump up to:a b By Jon Xavier, BizJournals. “NetBase filters social media for what clients need to know.” June 3, 2011.
  4. Jump up^ By Barbara Quint, Information Today. “Elsevier and NetBase Launch illumin8.” February 28, 2008.
  5. Jump up^ The Economist. “Improving Innovation.” February 29, 2008.
  6. Jump up^ By Rachael King, BusinessWeek. “Most Loved -- And Hated -- Tech Companies.”
  7. Jump up^ Darrow, Barb (December 12, 2011). "SAP taps NetBase for deep social media analytics". GigaOm. Retrieved May 8, 2012.
  8. Jump up^ San Jose Mercury News. “People on the Move.” May 15, 2011.
  9. Jump up^ By David F. Carr, InformationWeek. “How Much is your Brand Loved (or Hated)?” June 16, 2011.
  10. Jump up^ By Eric Schoenfeld, TechCrunch. "NetBase Offers Powerful Semantic Indexing Platform That Reads The Web." April 22, 2009.
  11. Jump up^ By Jon Xavier, BizJournals. "NetBase filters social media for what clients need to know." June 3, 2011.
  12. Jump up^ By Barbara Quint, Newsbreak. "Elsevier and NetBase Launch illumin8." February 28, 2008.
  13. Jump up^ By Neil Glassman, Social Times. “What Every Social Media Marketer Should Know About NetBase.” August 24, 2010.
  14. Jump up^ By Ryan Flinn, BusinessWeek. “Wanted: Social Media Sifters.” October 21, 2010.
  15. Jump up^ By David F. Carr, InformationWeek. “How Kraft Foods Listens to Social Media.” June 30, 2011.
  16. Jump up^ By Ryan Flinn, Bloomberg. “Tech companies measure online sentiment.” May 19, 2011.
  17. Jump up^ By Geoffrey Fowler and Alexandra Berzon, Wall Street Journal. “Social Media Buzzes, Comes Into Its Own.” May 2, 2011.

【一日一parsing:走火入魔,parser 貌似发疯了】

我:
系统调试也上瘾。今夜无眠,调着调着,parser 貌似发疯了,大概是嫌我啥都喂给它,闹情绪了??

0927a
仔细瞅瞅,好像也没啥大错,没疯。与鲁爷【狂人日记】不同,我怀疑得没理。

自然语言的任何并列(Conj)结构,到了逻辑层,都必须分列。赶上遇到好几个并列就热闹了,关系有组合爆炸的趋向。都是汉语的顿号惹的祸。用恁多顿号做啥,多写几个小句能死吗?纯句法parsing不管这些,图面倒是显得干净。可是 deep parsing 的语义计算是逻辑的,就不能不管。

白:
“或”的结合能力弱于“与”,顿号在被“或”绑架不成情况下标配解释为“与”。

我:
这几天净出怪,不知是机器走火入魔了,还是玩机器的走火入魔,总之,出来一些奇奇怪怪的 graphs,远远不是教科书上展示的句法树形图给人留下的印象。教科书都是这样的,太过优雅

parse_tree_1

前两天出了一个葫芦形的图,昨天又出了双伞形的,今天是发飙,明天还不知会咋样。

这是昨天的两把伞。瞅了一瞅,好像也没错:

0926a

白:
吗的位置不对。两把伞那个,能……吗,才是一对。

我:
对,“吗“”应该更上一层楼。如果没有上一层,“吗”疑似就对了。为个小词爬楼不值当了,不是不可以爬 (patching). 当然这里面其实牵涉到决定 yes-no question 的所属问题,最终可能还是要上。

如果说 “电子签证是什么吗。”那就是活用。表面上用疑问,实际是应该是感叹?不是“吗”的标准用法。因为“吗”的本性是一般疑问句,而“什么”是特殊疑问句的疑问词(wh-word),不相谐。

白:
那个是“嘛”,不是“吗”

我:
肯定这里不可以用 “吗” 吗?

白:
他知道电子签证是什么

我:
感觉上可以,好像也不等同于“嘛”。

是那个什么吗。
真地忘了是那个什么了。

白:
你说的感叹义,应该用“嘛”。遗忘义,可以用“吗”
不过现在白字用的,早乱套了。

我:
这是前天的葫芦,白老师的名句。就是“与之”没挂上arg,差强人意,但总体逻辑语义的计算还都对。“你”(S)与“女人”(S)结了婚,而且这事儿修饰的(Mod-S:定语从句)是“女人”。

0925a

你说机器神不神,parser 好玩不好玩,这算不算对人类语言的机器理解的敲门砖:芝麻开门!芝麻芝麻快开门。

 

【相关】

【立委科普:语法结构树之美】

【立委科普:语法结构树之美(之二)】

中文处理

Parsing

【置顶:立委NLP博文一览】

《朝华午拾》总目录

 

Who we are. Not an ad, but a snapshot.

NetBase

WHO WE ARE

n1

EMPOWERING GLOBAL BUSINESSES WITH SOCIAL INSIGHTS

We are uniquely positioned to help global businesses create real business value from the unprecedented level of growth opportunities presented each day by social media. We have the industry’s fastest and most accurate social analytics platform, strong partnerships with companies like Twitter, DataSift, and Tumblr, and award-winning patented language technology.

We empower brands and agencies to make the smartest business decisions grounded on the deepest and most reliable consumer insights from social. We’ve grown 300 percent year-over-year and excited to see revenue grow by 4,000% since the second quarter of 2012.

RECENT ACCOLADES

We were recently named a top rated social media management platform by software users on TrustRadius and a market leader by G2 Crowd.

n2

“NetBase is one of the strongest global social listening and analytics tools in the market. Their new interface makes customized dashboard creation a breeze.”

- Omri Duek, Coca-Cola

“Data reporting is both broad and detailed, with the ability to drill down from annual data to hourly data. NetBase allows us to have a pulse on the marketplace in just a few minutes.”

- Susie Thomas, VP, Palisades Media Group

“We started with a gen one solution, but then found that we needed to move to a tool with a better accuracy that could support digital strategy and insights research. NetBase satisfied all our needs.”

- Jared Degnan, Director of Digital Strategy

“As one of the first brands to test NetBase Audience 3D for our Mobile App launch, we’ve found that we could engage with our consumers on a deeper, more human level that further drives them to be brand champions.”

- Mihir Minawala, Manager of Social, Industry & Competitive Intelligence, Taco Bell

OUR CUSTOMERS

We work with executives from forward-looking agencies and leading brands across all verticals in over 99 countries. Our customers use NetBase for real-time consumer insights across the organization, from brand and digital marketing, public relations, product management to customer care.

KEY MILESTONES

  • March 2003
    Founded by Michael Osofsky at MIT. Later joined by Wei Li, Chief NetBase Scientist
  • July 2009
    P&G, Coca-Cola and Kraft signed as first customers of NetBase
  • January 2014
    Named Best-in-Class By Consumer Goods Technology
  • April 2014
    Launched Brand Live Pulse, the first real-time view of brands’ social movements
  • May 2014
    Celebrated 10 years with 500% customer growth in 3 years
  • January 2015
    AdAge Names 5 NetBase Customers to the Agency A-List
  • March 2015
    Introduced Audience 3D, the first ever 3D view of audiences
  • April 2015
    Raised $33 MM in Series E Round
  • November 2015
    Named Market Leader by G2 Crowd. Earned Top Ratings by Trust Radius

n3

What inspired you to join NetBase?

It was exciting to build the technology that could quickly surface meaningful customer insights at scale. For example, what used to take a day to run a simple analysis now takes just a second. Our platform now analyzes data in “Google time”, yet the depth and breadth of our analysis is exponentially deeper and larger than what you’ll ever get from a Google search.

What are you most proud of at NetBase?

I’m especially proud that we have the industry’s most accurate, deepest, fastest, and more granular text analysis technology. This enables us to gives our customers very actionable insights, unlike other platforms that offer broad sentiment analysis and general trending topics. Plus, NetBase reads 42 languages. Other platforms don’t even come close. We are customer-centric. Our platform truly helps customers quickly identify their priorities and next steps. This is what sets us apart.

What is the next frontier for NetBase?

With the exploding growth of social and mobile data and new social networks emerging, we’ll be working on connecting all these data points to help our customers get even more out of social data. As Chief Scientist, I’m more excited than ever to develop a “recipe” that can work with the world’s languages and further expand our language offerings.

WE’RE GLOBAL: 42 LANGUAGES, 99+ COUNTRIES, 8 OFFICES

NetBase Solutions, Inc  © 2016

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

【一日一parsing:parser 超越创造parser的人,不是不可能的】

460225017498569285白:
“那些林彪说过的话”
看看复数指示词(det)是如何跳过单数NP找到自己的中心语的。

我:

0924a

0924b
何难之有?

0924c

看着最后这句出来,不禁有些惶恐:这样下去,机器超越造机器的人,不是不可能的。内行看门道,自不必说,可今天还是对后学做个科普吧:为什么说此句的 deep parsing 牛得达到了语言学专家的水平,已经超越了普通人的语言结构分析的能力呢?这个自动生成、看似简单的树形图涵盖这么多的语言学:

(1) 复数指示词 “那批” 跳过了近距离的“你”,甚至跳过了定语从句的谓词“写-过”,连上了远距离的中心词“文章”,做其修饰语(Mod),牛不牛?

(2) 确定了定语从句(Mod-S)“你写过的”及其中心词“文章”;

(3) 定语从句谓词“写过”的主语(S)“你”和逻辑宾语(O)“文章”(所谓的 argument structure 的解构);

(4) 句首的这个带有定语从句的名词短语(“......文章”),与后续句子的谓词“保存-着”的远距离动宾关系(O)也揭示了,这个也有点儿牛吧;

(5) 事实上,句子主干的主(S)谓宾(O)都是各就各位,还有那些小词也都附着到了应该存在的地方(X)。

从深度结构分析的逻辑语义角度,可以说以上的分析已臻完美。

科普完。

能够达到以上对咱中文语句的语言学自动深度分析(deep parsing)水平的,得瑟一哈,也许算是可以原谅的“寡人之疾”了吧。

得瑟毕。

抹一把插大葱的象鼻,拍拍尘土,咱继续谦虚谨慎愚公移山去也。

白:
最后这句的next有些多余
即使去掉,所有有用的关系都在

我:
Next 是桥梁(敲门砖),本来是可以用完扔掉的,后来觉得留下也可以。
做个青春的纪念。
青春是褒义词,耍流氓是贬义词,但都是一回事儿:盲目躁动。(Next 残存了一点语序的信息,虽然逻辑上没有语序的地位,但在语义落地的时候,这个痕迹有时可能还有一点用。)

我一直相信,结构分析,机器达到或超越人的水平,是在望的。
结构分析后的语义落地,与人类的智力还有一些距离。但是因为语义落地几乎都是面向领域或应用的,因此有 leverage,有些觉得是天大的难题,有时在领域语用里面,就自然化解了,或者简化了。由此看来,NLU (或语义计算)是靠谱的 monster。

近两个月出了两件牛刀宰鸡的事儿。一个是英文,一个是中文。具体不让说,但可以假语村言。都是在某个产品领域被认为是拦路虎的与自然语言有关的难题。研究了一下,回答说,有了 deep parsing 的核武器,这有何难?

演练了一下,真地就是牛刀宰鸡,一眼见底。很多人以为核武器之说是立法委的极度夸张。天知地知,还真不是。被演义的对象说,这个难题在这个产品领域一旦解决,有很多后续的应用。可是如果不是不得已,还是想做牛刀宰牛的活计,而不是陷入鸡窝去没完没了地宰鸡。胜之不武啊。古训不是有说,不为五斗米折腰嘛。但愿不至于落到五斗米的田地。

【相关】

中文处理

Parsing

【置顶:立委NLP博文一览】

《朝华午拾》总目录

 

【李白对话录:如何学习和处置“打了一拳”】

白:
“张三打了李四一拳”“张三打李四的那一拳”
我的问题:1、“一拳”在两个例子里,跟“打”的“逻辑语义关系”是否是相同的?
2、如果相同,这种关系是不是萝卜和坑的关系?
3、如果是,那么这个坑是“打”自带的,还是被“一拳”的出现逼出来的?
4、非自带但可以被逼出来的坑,是一个个别现象还是一个普遍现象?是汉语特有的现象还是一个语言共性现象?
2':如果不同,第二例中的定语从句和中心语“那一拳”之间的关系是怎么建立的?
“张三喊了一嗓子”“张三喊的那一嗓子,我老远就听见了”,一个道理
另外,“回马枪”“窝心脚”等“工具扩展为招式”固定短语,是不是可以直接略掉量词,与数词结合?

我:

1. 逻辑语义上应该相同,句法上有【主谓】和【定语从句+NP】 的不同,很典型。

2 具体说,“打一拳” 就是搭配,是合成动词,与“洗澡”可比,不过后者是动宾搭配,前者是动补搭配。都是合成词的句法表现,都涉及词典与句法的动态接口。
直接量的搭配,当然属于罗卜与坑。
语言中的萝卜和坑,不外是 :(1)一个直接量(词)准备了一类词(feature)的坑;(2)一个直接量(词)准备了另一个直接量(词)的坑,通常叫强搭配;(3)一类词(feature)准备了另一类词(feature)的坑。(3) 是常规句法的表现,属于空对空,两边都不着地。其规则(feature based grammar)概括性强,但容易遭遇例外的滑铁卢。lexicalized grammar or word driven rules,越来越远离(3),或者把(3)限定在一个极少的数量上。那么就剩下(1)和(2)了。
“打...一拳” 是(1),这就到了你的第三个问题,两个直接量的搭配,谁 expects 谁?
纯技术上讲,根本就没有区分,或者说,等价。x 与 y 相互勾搭,说是 x 勾搭了 y 或者 y 勾搭了 x,都无所谓,反正他们是一家人,本来就是一个词,一个概念,不过到了语言表达,被人为分开了距离。

【3、如果是,那么这个坑是“打”自带的,还是被“一拳”的出现逼出来的?】
“打一拳”就是一个词条,概念上是混为一体的,不分你我,无所谓主次(动补的主次是词法内部的,可以无视)。但是操作上,可以有说法。(不知道汉语的搭配词典里面,“打一拳”这样的条目是放在 “打” 的下面,还是 “一拳” 的下面,还是两个地方都有?)但是,在NLP实现中,“打一拳” 与 “洗澡” 一样,是一个特定的分离词词条。不过是标签不同而已,譬如 Vo 与 Vbu,其他的事儿就交给句法了。

【4、非自带但可以被逼出来的坑,是一个个别现象还是一个普遍现象?是汉语特有的现象还是一个语言共性现象?】
对于直接量搭配,我的看法是,没有自带和被逼的问题,都是两厢情愿的相互吸引。
这个应该属于普遍现象: x--y,汉语有 “洗-澡”, 英语有 “take--bath”。词法是动补或者动词与状语这样的直接量与直接量的搭配,其他语言肯定也会有,不过一时想不到例子而已。

白:
打一苕帚疙瘩,也是搭配
任何顺手的东西,都可以抄起来就打
搭配的做法未免太ad hoc

我:
所有的词典都是 ad hoc,不然就不叫绑架了。但是 词条背后的 x--y 搭配 则是有语言共性的。

白:
问题是不可穷尽,而且本来能产,是一个有规律性的现象,打两鞭子,砍三刀,踹五脚。

我:
不可穷尽 那就不是 x--y 强搭配。理论上 不是 x --- y,就只能是 x ---- feature,或者 feature1 ----- feature2,没有其他的框可以进去。
“砍三刀” 与 “洗三个澡” 可比吗?要是可以,那就是 x --- y,可变的不过是 numeral,两端还是固定的:“踹-脚”,“砍--刀”。

白:
加量词的不算,只算省略量词的.明显的是工具,但是原动词很难说自带了“工具”这个坑。

我:
有些中间地带的现象。
说到底是路线问题。如果是 lexicalist 的路线,中间地带的一律进入词典,不在乎 ad hoc,不在乎冗余,好处是精准。如果是“传统”的文法,那就把中间地带划归到句法去,具有完全的产生性,好处是 不错的recall,但很容易被例外搅合,损失了精准(precision)。当然也可以二者结合,先弄一条 recall 的兜底,然后见到中间地带弄错了的,再去结合词典堵它。recall 楼底的可以想象的 rule 是这样的,利用了汉语名词通常不能直接为数词修饰的句法特点:

V + CD + N --> V Buyu

这一条可以搂住很多,但是危险。修修补补也可以把这条规则的危险减小,但不能杜绝,因为这是 feature based rule 的本性(POS 是 feature )。

接着练,我们可以有个楼底的规则来满足白老师说的某种语言现象的共性:

V +(时态小词)+ CD + N ==> V <-- Buyu[CD+N]

这条规则可以 parse 上面列举的所有现象,但是还是 too “powerful”, recall 有余,precision 不足。不过 precision 这东西,工程上靠的就不断扩大测试,测试不错的话就当没有精度问题,如果测试遇到问题了,有三个路子:(1)一个是在这一条规则中打磨,把 POS 条件细化成子类或ontology,或其他限制;(2) 第二个路子是另写一条细线条规则去 override 它,使得文法成为一个 hierarchy 的模块;(3) 第三个路子就是把错的东西(例外)扔进词典, 这实际上等价于第二条路子的极限 case,把词典当成是 rule hierarchy 的极端。有了这么一个从词典规则,到细线条 feature 规则,最后到 POS 的抽象层规则的 hierarchy 的规则化设计,就可以应对语言的例外、个性一直到共性及其之间的灰色地带。

懒得大数据,甚至懒得词典绑架搭配,上面那条默认规则送进系统先凑合事儿吧,就坐等今后例外慢慢地出现,再说。

0925b

0925c

白:
为什要在细粒度基于规则
这里说的这个层面规则的缺点,用学习对付起来正是优势

我:
不要细粒度也可以啊,抓两头带中间。大不了有些 redundancy,灰色的一律当成黑色。不可穷举不过是一种修辞说法。从统计上,处于灰色地带的东西一定是可以穷举的,不过是穷举到后来成了统计性长尾,不要再举而已。

白:
我是说,这里不存在二分法,除了词典捆绑就是基于规则, 可以基于学习

我:
白老师可以 illustrate 基于学习的东西,优势在哪里?(其实这个问题,我没觉得是一个对规则系统的挑战。没觉得它的挑战超越了 “洗澡”)

白:
不能穷举、规则又零乱,正好拿可以部分例子来学。feature很值钱,长尾的实例也很值钱,裹在一起学才是正道,既有泛化,又有死记硬背。

白:
拿有规律性的东西死记硬背,是逼着好孩子耍流氓

我:
从良性角度,也可以说是教育孩子脚踏实地,一步一个脚印。

白:
在泛化和死记硬背的灰色地带,该用学习就用学习。
看着不爽,又不是没办法。
只有应试教育、临阵磨枪,才把什么活的都搞死

我:
这里面的根本是,迄今为止,一个系统要不是统计的,要不是规则的。所谓 hybrid 的系统,大多是是两个系统的叠加,而不是融合。在这样一个 context 下,就不是说,我规则的规则,词典的词典,中间混杂一些统计学习。虽然后者应该是一个研究方向,而且应该可能做得比叠加式 hybrid 更高明。如果白老师说的是纯粹的学习系统,那是另一套话语体系,no comment。从规则这边看,抓两头,把灰色当黑色做,没有问题,不过是磨时间而已。共性规则保证了 recall,而 precision 就是时间的函数。

白:
我说的是,谁可能跟谁结合用规则,在同样符合规则情况下,谁排除跟谁结合用学习,但这是无监督学习,标注来自词典。前面用规则的只涉及萝卜、坑和帽子,不涉及subcat。后面学习的则是用subcat。

我:
其实 就用 V+CD+N 这个简单的模式到海量数据去,抓回来的无监督学习也大体就齐了。这是一个很狭窄的语言现象。无监督学习的结果就是这个特定的 subcat 的 knowledge acquisition,这是一个 offline 的学习过程。然后再利用学习出来的结果,支持 parsing

白:
其实这楼已经歪了。我的本意是在探讨逼出来的非标配的坑。
如果可以那样做,离语言的本质或许更近。

“他上学的那个学校”;“他约会的那个晚上”。

不加数词也存在把在一个句式里充当状语或补语的名词在另一个相关句式中充当主谓语,而逻辑语义关系不变的情况。而那个名词的真实身份是工具、处所、时间等角色。本来对于动词来说不是标配的。来到了某种位置,就逼迫动词把这个角色变为标配。
英语的介词结尾:the man you look for,可以给它们明确身份,即使在定语从句,也是庶出(介词养的,不是动词养的)。当然可以说动介组合look for养的。
汉语里进入定语从句后分不出来谁养的,反正介词消失了,带着反而不对。带着就要把零形式用真实代词替换:“你在其中上学的学校”,“你与之结婚的女人”

加数词,只不过突出了动量含义,不改变逻辑语义关系。

砍张三的斧子……着眼工具
砍张三的两斧子……着眼动作的次数
砍张三的斧子……用来(以/之/其)砍张三的斧子

我:
补语表示次数是逻辑语义工具在语言中的"虚化"(同时“形象化”)的用法,这种虚化用法本身不是语言共性,但可以映射到到深层的逻辑语义【工具】: 【工具】是 universal 的。就“砍”而言,【工具】不是逼迫出来的标配,而是自带的标配,不信可以查董老师的 HowNet,结婚 的标配是 with [human],对于 上学, 学校 是不是自带的?大概也可以这么说,不知道知网里面 上学 有没有一个 location 的槽,标配是学校。

可以找一个完全 random 的定义或状语试试,好像不行。似乎很难找到一个具有同样逻辑语义的,并且可以参与下面两个句式的案例:补语句式(表示次数)和定语句式。换句话说,这种现象要不就是搭配,要不就是搭配的延伸,而不是 random 的修饰语(adjunct)的组合,或者从 adjunct 被逼迫成的 complement,里面的逻辑语义是概念关系的某种 argument,有其结合的必然性。这种搭配似乎可以是词对词(两条腿落地),也可能是词对小类(feature:一条腿落地)。前者是强搭配的词典绑架,后者是灰色的,不一定可以绑架得了,统计可以学习出来。

白:
正是我要说的

我:
白老师岂止是四两拨千斤 lol

词对小类的subcat的习得,譬如 某个动词要求的是某种宾语(譬如【human】),这种东西可以从大数据学习出来:这个概念已经有日子了。剑桥大学一个教授多年前就倡导这种学习,好像也做了一批实验,印象也发表了一些文章。但这些研究总体来说是零星的,研究的归研究,应用的归应用,二者似乎也没有什么结合起来让人印象深刻的成果。

白:
没有把搭配学习锚定在结构上,是没戏的
你如果又学结构又学搭配,肯定乱套
一定是选定少数几种可能的结构,让搭配来进一步甄别,各司其职

白:
“砍”的工具可以是标配,“打”不行。适合“打”的subcat很不整齐,我们心里想的是“顺手可以抄起来的物件”但是subcat列表上不会顺顺当当给你这个。于是,要诸多subcat、诸多词例都当作features,想办法从可以列举的例子(包括已经可以确认的词例-subcat子规则)学出来。
炉子太大,抄不起来。房子更大。扫把大小适中。细菌太小。所以,“张三打李四一大肠杆菌”不通。

我:
用 pattern 打+CD+N,一学一个准 只要有海量数据,根本不用怕噪音,因为这个 pattern 非常好使。
联想到10多年前谷歌有人发过一篇论文,用两个特别拣选的 ngram patterns,学出了 ISA 的 taxonomy,让人印象深刻。后来我们还重复了这个工作,虽然并没真正用上其结果,但路子是对的。照着类似学习的路子,HowNet 有一天也是可以学出来的,只要董老师定义好要学的几个语义关系的性质,找到合适的 patterns。
谷歌用的两个 patterns 是: N such as X, Y, Z ;X, Y, Z and other N

e.g.
furniture such as desks, chairs, coffee-tables
desks, chairs, coffee-tables and other furniture (will all be on sale)
taxonomy is: {X, Y, Z} -->N

学他有啥用,反正人拍着脑袋慢慢想也可以想出来呀。HowNet 语义关系丰富,所以编写了很多年,但是终究还是编写出来了,几乎完备了(董老师好像如今只是零星地补充和添加了)。既然专家可以人工编写,既完备,又精良,有什么理由指望大数据去习得这些知识呢?这是问题的一面,特别是对于相对恒定久远的概念语义关系,确实没有道理不用专家的产品。

问题的另一面是,对于具有某种流动性的概念关系,专家很难赶得上机器习得(acquisition),还有不同领域的知识,等等。这是人力不及的地带,只有指靠大数据和机器了。上面的谷歌论文中举了一些例子,特别有意思,记得是说,学出来一个 dictator 的下位概念,里面的成员极具大数据的特点,有 卡斯特罗,毛泽东,斯大林,希特勒,etc。

白:
这是主观分类了,不合适放词典里。还有“知名品牌”的实例, 马上就有商业价值了。

我:
这不是我每天做的工作吗:social media mining of public opinions and sentiments
我们公司定期出版全球知名品牌的口碑排行榜之类,印刷精良。以前出版的是奢侈品牌(名牌包、名牌轿车、高级香水)等。最近出的一期是: Social Media Industry Report 2016: Restaurant Brand

刚测试了一下白老师的例句,最奇葩的是这个:

0925a

长成葫芦状的树形图,以前还真没见过。(词典里没有小词 “与之”,PP 也没合成它,于是被略去。)尽管如此,整个图是很逻辑的,撞了不知道什么运:“你”是结婚的一方(S),“女人”也是结婚的一方(S),这两方结婚的事件是一个定语从句(Mod-S),修饰到了“女人”的头上。至于小词 “的”、“之”,还有耍流氓的咸猪手 Next,这一切都是帮助建立结构的敲门砖,这些表层东西与逻辑语义无关,留在那里不是为了碍眼,而是为了在语义的语用落地的时候,万一需要表层痕迹的一些帮助呢。after all 语义计算的的目的不是为了画出好看的逻辑的图,自娱娱人,而是为了落地、做产品。

 

【相关】

中文处理

Parsing

【置顶:立委NLP博文一览】

《朝华午拾》总目录

 

Chart Parsing Chinese Character Strings

W. Li. 1997. Chart Parsing Chinese Character Strings. In
Proceedings of the Ninth North American Conference on Chinese
Linguistics (NACCL-9). Victoria, Canada.

Chart Parsing Chinese Character Strings [1]

 

Wei  LI

Simon Fraser University
Burnaby B.C. V5A 1S6 CANADA ([email protected]) 

 

ABSTRACT

This paper examines problems in word identification for a Chinese natural language processing system and presents our solution to these problems. In conventional systems, written Chinese parsing takes two steps: (1) a segmentation preprocessor for word identification (segmenter); (2) a grammar parsing the string of identified words. Morphological analysis, when required, as in the case of productive word formation, has to be incorporated in the segmenter. This matches the conventional morphology-before-syntax architecture. We will demonstrate the theoretical defect of this architecture when applied to Chinese. This leads to the conclusion that segmentational approach, despite its being the mainstream in Chinese computational morphology, is in general not adequate for the task of Chinese word identification. To solve this problem, a full grammar should be made available. Therefore, we take an alternative one-step approach. We have implemented an integrated grammar of morphology and syntax for directly parsing a string of Chinese characters, building both morphological and syntactic structures. Compared with the conventional two-step approach, our strategy has advantages in resolving ambiguity in word identification and in handling productive word formation.

  1. Introduction

A written Chinese sentence is a string of characters with no blanks to mark word boundaries. In conventional systems, Chinese parsing takes two steps as shown in the following Figure 1: (1) a segmentation preprocessor (called segmenter) for word identification; (2) a word based parsing grammar, building syntactic structures (Feng 1996, Chen & Liu (1992).

hpsg4

 

In contrast, we take an alternative one-step approach, as shown in Figure 2 below. We have implemented a grammar named W‑CPSG (for Wei's Chinese Phrase Structure Grammar). W‑CPSG integrates morphology and syntax for character based parsing, building both morphological and syntactic structures.

hpsg5

In the two-step architecture, the purpose for the segmenter is to properly identify a string of words to feed syntax. This is not an easy task due to the possible involvement of the segmentation ambiguity. For example, given a string of 4 Chinese characters 研究生命, the segmentation ambiguity is shown in (1.a) and (1.b) below.

(1.)  研究生命

(a)        研究生               |
graduate student         | life or destiny

(b)        研究    | 生命
study   | life

The resolution of the above ambiguity in the segmenter is a hopeless job because such ambiguity is syntactically conditioned. For sentences like 研究生命金贵 (life for graduate students is precious), (1.a) is the right identification. For the phrase 研究生命起源 (to study the origin of life), (1.b) is right. So far there are no segmenters which can handle this properly and guarantee right word segmentation (Feng 1996). In fact, there can never be such segmenters as long as a grammar is not brought in. This is a theoretical defect of all Chinese analysis systems in the conventional architecture. We have solved this problem in our morphology-syntax integrated W‑CPSG. Word identification in our design becomes a by-product of parsing instead of a pre-condition for parsing.

In the text below, Section 2 investigates problems with the conventional two-step approach. In Section 3, we will present W‑CPSG one-step approach and demonstrate how W‑CPSG parsing solves these problems. The following is a list for abbreviations used in this paper.

A (Adjective); AF (Affix); BM (Bound Morpheme);
CLA (Classifier); CLAP (Classifier Phrase);
DE (Chinese particle introducing a modifier of noun); DEP (DE Phrase);
DE3 (Chinese particle introducing a modifier of result or capability);
DET (Determiner); LE (Chinese perfective aspect marker);
N (Noun); NP (Noun Phrase); P (Preposition); PP (Prepositional Phrase);
S (Sentence); V (Verb); VP (Verb Phrase); Vt (Transitive Verb)

  1. Problems Challenging Segmenters

In general, there are two basic problems for segmenters, namely, segmentation ambiguity and productive word formation.

2.1. segmentation ambiguity

This sub-section studies the segmentation ambiguity for Chinese word identification. We indicate that this ambiguity is structural in nature. Therefore it should be captured by structural trees via parsing. We conclude that a parsing grammar is indispensable in the resolution of the segmentation ambiguity.

Behind all segmenters are procedure based segmentation algorithms. Most proposals are some modified versions of large-lexicon based matching algorithms. As an underlying hypothesis, a longer match overrides a shorter match, hence the name maximum match. Decided by the  direction of the  procedure, i.e. whether  the segmentation proceeds from left (the beginning of a string) to right (the end of the string) or from right to left, we have two general types of maximum match: (1) FMM (Forward Maximum Match) algorithm; (2) BMM (Backward Maximum Match) algorithm (Feng 1996).

According to Liang 1987, segmenters have trouble with cases involving the segmentation ambiguity. There are two types of segmentation ambiguity: the cross ambiguity (AB|C vs. A|BC) and the embedded ambiguity (AB vs. A|B).

To detect possible ambiguity, many researchers use the technique of combining the FMM algorithm and the BMM algorithm. When the output of FMM and BMM are different, there must be some ambiguity involved. The following table lists the cases associated with the FMM and BMM combined approach.[2]

hpsg6

The following 3 examples all contain a cross ambiguity sub-string 研究生命 with 2 segmentation possibilities: 研究生|命 and 研究|生命. Example (4.) is a genuinely ambiguous case. Genuinely ambiguous sentences cannot be disambiguated within the sentence boundary, rendering multiple readings.

(2.) case 1:      研究生命金贵。

(a)        研究生                |      | 金贵                  (FMM: correct)
graduate student         | life   | precious
Life for graduate students is precious.

(b) * 研究 | 生命    |起源                                   (BMM: incorrect)
study        | life     | precious

(3.) case 2:       研究生命起源。

(a) *     研究生              | 命     | 起源                       (FMM: incorrect)
graduate-student       | life   | origin

(b)        研究     | 生命    | 起源                                (BMM: correct)
study   | life     | origin
to study the origin of life

(4.) case 3:       研究生命不好。

(a)        研究生                   | 命                 |        |      (FMM: correct)
graduate student         | destiny        | not     | good
The destiny of graduate students is not good.

(b) 研究 | 生命   | 不      | 好                                      (BMM: correct)
study    | life     |  not    | good
It is not good to study life.

The following example is a complicated case of cross ambiguity, involving more than 2 ways of segmentation. Both the FMM segmentation 出现|在世|界 and the BMM segmentation 出|现在|世界 are wrong. A third segmentation 出现||世界 is right.

(5.)  case 4:      出现在世界东方。

(a) * 出现 | 在世          |      | 东方                       (FMM: incorrect)
appear     | be-alive   | BM   | east

(b) * 出  | 现在  | 世界    | 东方                               (BMM: incorrect)
out        | now   | world | east

(c)  出现  |     | 世界     | 东方                               (correct)
appear    | at    | world  | east
to appear in the east of the world

In the following examples (6.) through (8.), ¿¾°×Êí involves embedded ambiguity. As separate words, the verb ¿¾ (bake) and the NP °×Êí (sweet potato) form a VP. As a whole, it is a compound noun ¿¾°×Êí (baked sweet potato). In cases of the embedded ambiguity, FMM and BMM always make the same segmentation, namely AB instead of A|B. It may be the only right choice, as seen in (6.). It may be wrong as shown in (7.). It may only be half right, as in the case of genuine ambiguity shown in (8.).

(6.) case 5:       他吃烤白薯。

(a)        他       |       | 烤白薯                                 (FMM&BMM: correct)
he       | eat     | baked sweet potato
He eats baked sweet potatoes.

(b) *     他       |       |       | 白薯                        (incorrect)
he       | eat     | bake | sweet potato

(7.) case 6:       他会烤白薯。

(a) *     他       |       | 烤白薯                                 (FMM&BMM: incorrect)
he       | can    | baked sweet potato

(b)        他      |       |       | 白薯                         (correct)
he      | can   | bake | sweet potato
He can bake sweet potatoes.

(8.) case 7:       他喜欢烤白薯。

(a)       他       | 喜欢 | 烤白薯                                  (FMM&BMM: correct)
he      | like  | baked sweet potato
He likes baked sweet potatoes.

(b)        他       | 喜欢   |       | 白薯                       (correct)
he      | like     | bake | sweet potato
He likes baking sweet potatoes.

Compare the above examples, we see that there are severe limitations for the FMM-BMM combined approach. First, it only serves the purpose of ambiguity detection (when the results of FMM and BMM do not match), and contributes nothing to its resolution. It has no way to tell which segmentation is right (compare case 1 and case 2), and, worse still, whether both are right (case 3) or wrong (case 4). Second, even when the results of FMM and BMM do match, it by no means guarantees right segmentation (case 6). Third, as far as detection is concerned, it is only limited to the problems for the cross ambiguity. The existence of the embedded ambiguity defines a blind area for this way of detection (case 6 and case 7). This is because the underlying maximum match hypothesis assumed in the FMM and BMM segmentation algorithms is directly contradictory to the phenomena of the embedded ambiguity.

In face of ambiguity, how do people judge which segmentation is right in the first place? It really depends on whether we can understand the sentence or phrase based on the segmentation. In computational linguistics, this is equivalent to whether the segmented string can be parsed by a grammar. The segmentation ambiguity is one type of structural ambiguity, not in essence different from typical structural ambiguity like, say, PP attachment ambiguity. In fact, PP attachment problem is a counterpart of the cross ambiguity in English syntax, as shown below.

(9.)       Cross ambiguity in PP attachment: V NP PP

(a) [V NP] [PP]
(b) [V] [NP PP]

Therefore, like English PP attachment, Chinese word segmentation ambiguity should also be captured by a parsing grammar. A parser resolves the ambiguity if it can, or detects the ambiguity in the form of multiple parses when it cannot. As shall be demonstrated in Section 3, wrong segmentation will not lead to a parse. Right segmentation results in at least one successful parse. In any case, at least a parser (hence a grammar on which the parser is based) is required for proper word identification.

The important thing is that the ambiguity in word identification is a grammatical problem. The attempt to solve this problem without a grammar is bound to be crippled. Since traditional segmentation algorithms are non-grammatical in nature, they are theoretically not equipped for handling such ambiguity. A successive model of segmentater-before-grammar attempts to do what it is not yet able to do. This is the theoretical defect for almost all existing segmentation approaches.

(10.)     Conclusion for 2.1.

The segmentation ambiguity in word identification is one type of structural ambiguity. In order to solve this problem, a parsing grammar is indispensable.

2.2. productive word formation

Unless morphological analysis is incorporated, lexicon match based segmenters will have trouble with new words produced by Chinese productive word formation, including reduplication, derivation and the formation of proper names. When the morphology component is incorporated in the segmenter, the two-step design becomes a variant of the conventional morphology-before-syntax architecture. But this architecture is not effective when the segmentation ambiguity is at issue.

In the following, we investigate reduplication, derivation and proper names one by one. In each case, we find that there is always a possible involvement of the segmentation ambiguity. This problem cannot be solved by a morphology component independent of syntax. We therefore propose a  grammar incorporating both morphology and syntax.

2.2.1. reduplication

Reduplication in Chinese serves various grammatical and/or lexical functions. Not all reduplications pose challenges to segmentation algorithms. Assume that a word consists of 2 characters AB, reduplication of the type AB --> ABAB is no problem. What becomes a problem for word segmentation is the reduplication of the type AB --> AABB or its variants like AB --> AAB. For example, a two-morpheme verb with verb-object relation at the level of morphology has the following way of reduplication.

(11.) Verb Reduplication: AB --> AAB  (for diminutive use)

分心 (get distracted) --> 分分心 (get distracted a bit)

让他分分心。

让       | 他     | 分分心
let       | he    | get distracted a bit
Let him relax a while.

It seems that reduplication is a simple process which can be handled by incorporating some procedure-based function calls in the segmentation algorithm. If a 3-character string, say 分分心, cannot be found in the lexicon, the reduplication procedure will check whether the first 2 characters are the same, and if yes, delete one of them and consult the lexicon again. But, such expansion of the segmentation algorithm is powerless when the segmentation ambiguity is involved. For example, it is wrong to regard 分分心 as of reduplication in the following sentence.

(12.)   这件事十分分心。

(a) *     这       |      |         |       | 分分心
this      | CLA  | thing  | ten    | get distracted a bit

(b)        这       |       |         | 十分    | 分心
this      | CLA  | thing  | very   | distracting
This thing is very distracting.

2.2.2. derivation

In Contemporary Mandarin, there have come to be a few morphemes functioning similarly to English affixes, e.g. 可 (-able) turns a transitive verb into an adjective.

(13.)     可 (-able) + Vt --> A

可 (-able) + 读 (Vt: read) -->   可读 (A:readable)

这本书非常可读。

这       | 本     | 书       | 非常   | 可读
this    | CLA  | book  | very  | readable
This book is very readable.

The suffix 性 works just like '-ness',  changing an adjective into an abstract noun.  The derived noun 可读性 (readability) in the following example, similar to its English counterpart, involves a process of double affixation.

(14.)     A + 性 (-ness)  --> N
可 (-able) + 读 (Vt: read) -->   可读 (A:readable)
可读 (A:readable) + 性 (-ness) --> 可读性 (N:readability)

这本书的可读性

这       | 本      | 书       |      | 可读性
this    | CLA  | book  | DE    | readability
this book's readability

The suffix Í· can change a transitive verb into an abstract noun adding to it the meaning "worth-of".

(15.) Vt + 头 (AF:worth of) --> N

吃 (Vt:eat) + 头 (AF:worth of) --> 吃头 (N:worth of eating)

这道菜没有吃头

这       | 道     | 菜      | 没有             | 吃头
this    | CLA  | dish  | not-have    | worth-of-eating
This dish is not worth eating.

It is not difficult to incorporate in the segmenter these derivation rules for the morphological analysis. But, as in the case of reduplication, there is always a danger of wrongly applying the rules due to possible ambiguity involved. For example, 吃头 is a sub-string of embedded ambiguity. It can be both a derived noun 'worth of eating' or two separate words as seen in the following example.

(16.)  他饿得能吃头牛。

(a) *     他      | 饿             |       |      | 吃头·                       |
             he     | hungry    | DE3  | can  | worth-of-eating   | ox

(b)        他      | 饿              |      |      |       |       |
              he     | hungry    | DE3  | can  | eat    | CLA  | ox
He is so hungry that he can eat an ox.

2.2.3. proper name

Proper names are of 2 major types: (1) Chinese names; (2) transliterated foreign names. In this paper, we only target the identification of Chinese names and leave the problem of transliterated foreign names for further research (Li, 1997b).

A Chinese human name usually consists of a family name followed by a given name. Chinese family names form a clear-cut closed set. A given name is usually either one character or two characters. For example, the late Chinese chairman 毛泽东 (Mao Zedong) used to have another name 李得胜 (Li Desheng). In the lexicon, 李 is a registered family name. Both 得胜 and 胜 mean 'win'. This may lead to 3 ways of word segmentation: (1) 李得胜; (2) 李|得胜; (3) 李得|胜, as seen in the following examples.

(17.)    李得胜了

(a)  李    | 得胜 | .
       Li    | win  | LE
Li won.

(b)   李得   |      |
        Li De | win  | LE
Li De won.

(c) *  李得胜          | .
          Li Desheng | LE

(18.)   李得胜胜了 。

(a) *  李 | 得胜 |     | .
         Li  | win | win | LE

(b) *  李得   |      |      |
          Li De | win  | win  | LE

(c)   李得胜            |      |
Li Desheng   | win  | LE
Li Desheng won.

Since the given name like µÃʤ is an arbitrary string of 1 or 2 characters, the morphological analysis of the full name should start with family name which can optionally combine with any 1 or 2 characters to form candidate proper names Àî, ÀîµÃ and ÀîµÃʤ. In other words, family name serves as the left boundary of a full name and the length is used to determine candidates. The right segmentation can only be made via sentence analysis as shown in the above examples.

Most Chinese place proper names are made of 1 to 3 characters, for example, 武汉市(Wuhu City), 南陵县 (Nanling County). The arbitrariness of these names makes any sub-strings of n characters (0<n<4) in the sentence a suspect. Fortunately, in most cases we may find boundary indicators of these names, like 省 (province), 市 (city), 县 (county), etc. Once the boundary indicator is located, the similar technique in using Chinese family name to identify the given name can be applied to select candidates of place proper names for verification through grammatical analysis.

In general, there is always a possibility of ambiguity involvement in the formation of all types of proper names.

(19.)     Conclusion for 2.2.

Due to the possible involvement of ambiguity, a parsing grammar for morphological analysis as well as for sentence analysis is required for the proper identification of the words produced by Chinese productive word formation.

  1. W‑CPSG Grammatical Approach

This section presents W‑CPSG approach to Chinese word identification and morphological analysis. We will demonstrate how a parser based on W‑CPSG solves the problems of the word identification ambiguity and productive word formation.

3.1. rationale of W‑CPSG approach

There have been a number of word identification algorithms based on both morphological and syntactic information (see survey in Feng 1996 and Sun & Huang 1996). Most such approaches do not use a self-contained grammar to parse the complete sentence. They are confined to the conventional two-step process of the segmentation-before-grammar design. As long as the word identification procedure is independent of a parsing grammar, it is extremely difficult to make full use of grammatical information to resolve ambiguity in word identification. Careful tuning up and sophisticated design improves the precision but will not change the theoretical defect of all such approaches. Chen & Liu acknowledges the limitation of their approach due to the lack of a grammar.  “However”, they say,  “it is almost impossible to apply real world knowledge nor to check the grammatical validity at this stage”. (Chen & Liu 1992, p.105) Why impossible at this stage? Because these segmentation systems are based on the concept of  two-step architecture and the grammar is not yet available! As we have demonstrated, the final judgment for proper word identification can hardly be made until the whole sentence is parsed, hence the requirement of a full grammar. Therefore, we are forced to make a compromise in involving how much of grammatical information depending on how much word identification precision we can afford to sacrifice. Needless to say, there is significant double-labor between such a word segmentation procedure and the following stage of parsing. As more and more grammatical information is used to achieve better precision, the overhead of this double labor becomes more serious. We consider the double labor as one strong argument against the two-step approach. If enough grammatical information is incorporated, it is essentially equivalent to a grammar. And  the segmenter will be equivalent to a parser.  Then why two grammars, one for word identification, and one for sentence parsing? Why not combine them? That is exactly what we are proposing in W‑CPSG - one-step approach based on an integrated grammar,  eliminating the necessity of a segmentation preprocessor.

3.2. W‑CPSG character-based parsing

W‑CPSG (Li. 1997a, 1997b) is a lexicalized Chinese unification grammar. The work on W‑CPSG is taken in the spirit of the modern linguistic theory Head-driven Phrase Structure Grammar (Pollard & Sag 1994). W‑CPSG consists of two parts: a minimized general grammar and an enriched lexicon. The general grammar only contains a handful of PS (phrase structure) rules, covering complement structure, modifier structure, conjunctive structure and morphological structure. This is the nature of lexicalized grammars. PS rules in such grammars are very abstract. Essentially, all they say is one thing, that is, 2 signs can combine so long as the lexicon so indicates. The lexicon houses lexical entries with their linguistic description in feature structures. Potential morphological structures as well as potential syntactic structures are lexically encoded. In syntax, a word expects another sign to form a phrase. In morphology, a morpheme expects another sign to form a word. For example, the prefix 可 (-able) expects a transitive verb to form an adjective. The morphological PS rule will build the morphological structure when a transitive verb does appear after the prefix 可 (-able) in the input string.

We now illustrate how W‑CPSG parses a string of Chinese characters by a sample parsing chart. The prototype of W‑CPSG was written in ALE, a grammar compiler developed on top of Prolog by Carpenter & Penn (1994). ALE compiles W‑CPSG into a Chinese parser, a Prolog program ready to accept a string of characters for analysis. W‑CPSG parse tree embodies both morphological analysis and syntactic analysis, as shown below.

hpsg12

 

This is so-called bottom-up parsing. It starts with lexicon look-up. Edges 1 through 7 are lexical edges. Other edges are phrasal edges. Each edge represents a sign, i.e. a character (morpheme), a word, a phrase or a sentence. Lexical edges result from a successful match between the signs in the input string and the entries in the lexicon during lexicon look-up. For example, 可 (-able), 读 (read) and 性 (-ness) are all registered entries in the lexicon, so they get matched and shown by edge 5, edge 6 and edge 7. Words produced by productive word formation present themselves as phrasal edges, e.g. edge ((5+6)+7) for 可读性 (readability). For the sake of concise illustration, we only show two pieces of information for the signs in the chart, namely category and interpretation with a delimiting colon (lexical edges are only labeled for either category or interpretation). The parser attempts to combine the signs according to PS rules in the grammar until parses are found. A parse is an edge which ranges over the whole string. The parse ((((1+2)+3)+4)+((5+6)+7)) for (20.) represents a binary structural tree based on the W‑CPSG analysis, as shown below.

hpsg13

3.3. ambiguity resolution in word identification

Given the resources of a phrase structure grammar like W‑CPSG, a parser based on standard chart parsing algorithms can handle both the cross ambiguity and the embedded ambiguity provided that a match algorithm based on exhaustive lookup instead of maximum match is adopted for lexicon lookup. All candidate words in the input string are presented to the parser for judgment. Ambiguous segmentation becomes a natural part of parsing: different ways of segmentation add different edges, a successful parse always embodies right identification. In other words, word identification in our design becomes a by-product of parsing instead of a pre-condition for parsing. The following example of the complicated cross ambiguity illustrates how the W‑CPSG parser resolves ambiguity. As seen, both the FMM segmentation (represented by the edge sequence 8-9-5-10) and the BMM segmentation (represented by 1-11-12-10) are in the chart as a result of exhaustive lexicon lookup. They are proved to be wrong because they do not lead to a successful parse according to the grammar. As a by-product, the final parse (8+(3+(12+10))) automatically embodies rightly identified word sequence 8-3-12-10, i.e. 出现  (appear) |在  (at) |世界 (world) |东方 (east).

hpsg10

 

Exhaustive lookup also makes an embedded ambiguity sub-string like 烤红薯 no longer a blind area for word identification, as shown in (22.) below. All the candidate words in the sub-string including 烤 (bake), 红薯 (sweet potato), 烤红薯 (baked sweet potato) are added to the chart as lexical edges (edge 4, edge 8 and edge 10). This is a case of genuine ambiguity, resulting in 2 parses corresponding to 2 readings. The first parse (1+(7+10)) identifies the word sequence 他|喜欢|烤红薯, and the second parse (1+(9+(4+8))) a different sequence 他|喜欢|烤|红薯. Edge 7 and edge 9 represent two lexical entries for the verb 喜欢 (like), with different syntactic expectation (categorization). One expects an NP object, notated in the chart by like<NP>, and the other expects a VP complement, notated by like<VP>.

hpsg11

 

We now illustrate how Chinese proper names are identified in W‑CPSG parsing. In the W‑CPSG lexicon, Chinese family name is encoded to optionally expect the given name. Due to the arbitrariness of given names, no other constraint except for the length (either 1 character or 2 characters) is specified in the expectation. Therefore, we have three candidates for proper names in the following example, namely 李 (Li), 李得 (Li De), 李得胜 (Li Desheng), represented respectively by edge 1, edge (1+2) and the NP edge (1+5).[3] The first two candidates contribute to two valid parses while the third does not, hence the identification of the word sequences 李|得胜|了 and 李得|胜|了.

hpsg8

 

Now we add one more character 胜 (win) to form a new sentence, as shown in (24.) below.

hpsg9

 

The first two candidate proper names 李 (Li) and 李得 (Li De) no longer lead to parses. But the third candidate 李得胜 (Li Desheng) becomes part of the parse as a subject NP. The parse (((1+6)+4)+5) corresponds to the identification of the only valid word sequence 李得胜|胜|了.

Finally, we give an example to demonstrate how W‑CPSG handles reduplication in parsing and word identification. The sample sentence to be processed by the parser is 让他分分心 (Let him relax a while), involving the AB-->AAB type verb reduplication for diminutive use.

In most lexicons, 分心 (distract-heart: get distracted) is a registered 2-morpheme verb with internal morphological verb-object relation. Therefore, the reduplication is considered morphological. But in Chinese syntax, we also have a  general verb reduplication rule of the type A-->AA for diminutive use, for example, 看(look) --> 看看(have a look). This morphological verb reduplication rule AB-->AAB and the syntactic verb reduplication rule A-->AA are essentially the same rule in Chinese grammar. 分心 sits in the gray area between morphology and syntax. It looks both like a word (verb) and a phrase (VP). Lexically, it corresponds to one generalized sense (concept) and the internal combination is idiomatic, i.e. 分 (distract) must combine with 心 (heart) to mean 'get distracted'. But, structurally, the combination of 分 and 心 is not fundamentally different from a VP consisting of Vt and NP, as in the phrase 看电影 (see a film). In fact, there is no clear-cut boundary between Chinese morphology and syntax. This morphology-syntax isomorphic fact serves as a further argument to support the W‑CPSG design of integrating morphology and syntax in one grammar module. Although the boundary between Chinese morphology and syntax is fuzzy, hence no universal definition of basic notions like word and phrase, the division can be easily defined system internally in an integrated grammar. In W‑CPSG,  分心 is treated as a phrase (VP) instead of a word (verb). The lexical entry 分 (distract) is coded to obligatorily expect the literal 心 (heart) as its syntactic object, shown in the following chart by the notation V<>. This approach has the advantage of eliminating the doubling of the reduplication rule for diminutive use in both syntax and morphology, making the grammar more elegant. The verb reduplication rule is implemented as a lexical rule in W‑CPSG.[4] This lexical rule creates a reduplicated verb with added diminutive sense, shown by edge 8 (a lexical edge).  The whole parsing process is illustrated below.

hpsg7

 

 

REFERENCES

Carpenter, B. & Penn, G. (1994): ALE, The Attribute Logic Engine, User's Guide, Carnegie Mellon University

Chen, K-J., & S-H. Liu (1992): "Word identification for mandarin Chinese sentences". Proceedings of the 15th International Conference on Computational Linguistics, Nantes, 101-107.

Feng, Z-W. (1996): "COLIPS lecture series - Chinese natural language processing",  Communications of COLIPS, Vol.6, No.1 1996, Singapore

Li, W. (1997a): "Outline of an HPSG-style Chinese reversible grammar", Proceedings of The Northwest Linguistics Conference-97 (NWLC-97, forthcoming), UBC, Vancouver, Canada

Li, W. (1997b): W‑CPSG: A Lexicalized Chinese Unification Grammar And Its Application, Doctoral dissertation (on-going), Simon Fraser University, Canada

Liang, N. (1987): "Shumian Hanyu Zidong Fenci Xitong - CDWS" (Automatic word segmentation system for written Chinese - CDWS), Journal of Chinese Information Processing, No.2 1987, pp 44-52, Beijing

Pollard, C.  & I. Sag (1994): Head-Driven Phrase Structure Grammar,  Centre for the Study of Language and Information, Stanford University, CA

Sun, M-S. & C-N. Huang  (1996): "Word segmentation and part of speech tagging for unrestricted Chinese texts" (Tutorial Notes for International Conference on Chinese Computing ICCC'96), Singapore

~~~~~~~~~~~~~~~~~~~

[1] The author benefited from the insightful discussion with Dr. Dekang Lin on the feasibility of parsing Chinese character strings instead of word strings. Thanks also go to Paul McFetridge and Fred Popowich for their supervision and encouragement.

[2] This table is adapted from the following table in Sun & Huang (1996).

case 1 The output of FMM and BMM are different, but both are incorrect 0.054%
case 2 The output of FMM and BMM are different, but only one is correct 9.24%
case 3 The output of FMM and BMM are identical, but incorrect 0.41%
case 4 The output of FMM and BMM are identical, and correct 90.30%

The 4 cases which they listed are not logically exhaustive in terms of sentence based processing (i.e. when discourse is not involved in a system). In particular, there is another case when the output of FMM and BMM are different, and both are correct. We call this a case of genuine cross ambiguity.

[3] Note that there is another S edge (1+5) in the chart. These two edges are structurally different, created via different PS rules. The NP edge (1+5) is formed through the morphological PS rule, combining the family name (edge 1) and its expected given name (edge 5). In the S edge (1+5). however, it is the subject rule (one of the complement PS rules) that decides the combination of the predicate (edge 5) and its expected subject NP (edge 1).

[4] Lexical rules are favored by many linguists to capture redundancy in the lexicon instead of the conventional approach of syntactic transformation. Lexical rules are applied at compile time to form an expanded lexicon before parsing starts.

 

[Related]

Interaction of syntax and semantics in parsing Chinese transitive verb patterns 

Handling Chinese NP predicate in HPSG 

Notes for An HPSG-style Chinese Reversible Grammar

Outline of an HPSG-style Chinese reversible grammar

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

PhD Thesis: Chapter I Introduction

PhD Thesis: Chapter VII Concluding Remarks

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

Interaction of syntax and semantics in parsing Chinese transitive verb patterns

Interaction of syntax and semantics in parsing Chinese transitive verb patterns *
(old paper in Proceedings of International Chinese Computing Conference, ICCC'96)

Wei  LI

Department of Linguistics, Simon Fraser University
Burnaby, B.C. V5A 1S6 CANADA (email: [email protected])

Keywords: Chinese processing, transitive pattern, syntax, semantics, lexical rule, HPSG

Abstract

This paper addresses the problem of parsing Chinese transitive verb patterns (including the BA construction and the BEI construction) and handling the related phenomena of semantic deviation (i.e. the violation of the semantic constraint).

We designed a syntax-semantics combined model of Chinese grammar in the framework of Head-driven Phrase Structure Grammar [Pollard & Sag 1994]. Lexical rules are formulated to handle both the transitive patterns which allow for semantic deviation and the patterns which disallow it. The lexical rules ensure the effective interaction between the syntactic constraint and the semantic constraint in analysis.

The contribution of our research can be summarized as:

(1) the insight on the interaction of syntax and semantics in analysis;
(2) a proposed lexical rule approach to semantic deviation based on (1);
(3) the application of (2) to the study of the Chinese transitive patterns;
(4) the implementation of (3) in an unification-based Chinese HPSG prototype.

  1. Background

When Chomsky proposed his Syntactic Structures in Fifties, he seemed to indicate that syntax should be addressed independently of semantics. As a convincing example, he presented a famous sentence:

1)             Colorless green ideas sleep furiously.

Weird as it sounds, the grammaticality of this sentence is intuitively acknowledged: (1) it follows the English syntax; (2) it can be interpreted. In fact, there is only one possible interpretation, solely decided by its syntactic structure. In other words, without the semantic interference, our linguistic knowledge about the English syntax is sufficient to assign roles to each constituent to produce a reading although the reading does not seem to make sense.

However, things are not always this simple. Compare the following Chinese sentences of the same form NP NP V:

2a)           dianxin  wo           chi           le.
                Dim-Sum I               eat           LE.
The Dim Sum I have eaten.
Note:        LE is a particle for perfect aspect.

2b)   wo dianxin chi le.
I have eaten the Dim Sum.

Who eats what? There is no formal way but to resort to the semantic constraint imposed by the notion eat to reach the correct interpretation [Li, W. & McFetridge 1995].

Of course, if we want to maintain the purity of syntax, it could be argued that syntax will only render possible interpretations and not the interpretation.  It is up to other components (semantic filter and/or other filters) of grammar to decide which interpretation holds in a certain context or discourse. The power of syntax lies in the ability to identify structural ambiguities and to render possible corresponding interpretations. We call this type of linguistic design a syntax-before-semantics model. While this is one way to organize a  grammar, we found it unsatisfactory for two reasons. First, it does not seem to simulate the linguistic process of human comprehension closely.  For human listeners, there are no ambiguities involved in sentences 2a) and 2b). Secondly, there is considerable cost on processing efficiency in terms of computer implementation. This efficiency problem can be very serious in the analysis of languages like Chinese with virtually no inflection.

Head-driven Phrase Structure Grammar (HPSG) [Pollard & Sag 1994, 1987] assumes a lexicalist approach to linguistic analysis and advocates an integrated model of syntax and the other components of grammar. It serves as a desirable framework for the integration of the semantic constraint in establishing syntactic structures and interpretations. Therefore, we proposed to enforce the semantic constraint that animate being eats food directly in the lexical entry chi  (eat) [Li, W. & McFetridge 1995]: chi (eat) requires an animate NP subject and a food NP object. It correctly addresses who-eats-what problem for sentences like 2a) and 2b). In fact, this type of semantic constraint (selection restriction) has been widely used for disambiguation in NLP systems.

The problem is, the constraint should not always be enforced. In the practice of communication, deviation from the constraint is common and deviation is often deliberately applied to help render rhetorical expressions.

 

3) xiang      chi           yueliang,  ni             gou           de3    zhao       me?
    want        eat           moon,       you          reach       DE3  -able          ME?
Wanting to eat the moon, but can you reach it?
Note:  DE3 is a particle, introducing a postverbal adjunct of result or capability. ME is a sentence final particle for yes-no question.

4) dajia         dou   chi           shehui zhuyi,           neng         bu            qiong       me?
     people      all      eat           social -ism,               can            not           poor         ME
Everyone is eating socialism, can it not be poor?

yueliang (moon) is not food, of course. It is still some physical object, though. But in 4), shehui zhuyi (socialism) is a purely abstract notion. If a parser enforces the rigid semantic constraint, there are many such sentences that will be rejected without getting a chance to be interpreted. The fact is, we do have interpretations for 3) and 4). Hence an adequate grammar should be able to accommodate those interpretations.

To capture such deviation, Wilks came up with his Preference Semantics [Wilks 1975, 1978]. A sophisticated mechanism is designed to calculate the semantic weight for each possible interpretation, i.e. how much it deviates from the preference semantic constraint. The final choice will be given to the interpretation with the most semantic weight in total. His preference model simulates the process of how human comprehends language more closely than most previous approaches.

The problem with this design is the serious computational complexities involved in the model [Huang 1987]. In order to calculate the semantic weight, the preference semantic constraint is loosened step by step. Each possible substructure has to be re-tried with each step of loosening. It may well lead to combinatorial explosion.

What we are proposing here is to look at semantic deviation in the light of the interaction of the syntactic constraint and the semantic constraint. In concrete terms, the loosening of the semantic constraint is conditioned by syntactic patterns. Syntactic pattern is defined as the representation of an argument structure in surface form. A pattern consists of 2 parts: a structure's syntactic constraint (in terms of the syntactic categories and configuration, word order,  function words and/or inflections) and its interpretation (role assignment). For example, for Chinese transitive structure, NP V NP: SVO is one pattern, NP NP V: SOV is another pattern, and NP [ba NP] V: SOV (the BA construction) is still another. The expressive power of a language is indicated by the variety of patterns used in that language. Our design will account for some semantic deviation or rhetorical phenomena seen in everyday Chinese without the overhead of computational complexities. We will focus on Chinese transitive verb patterns for illustration of this approach.

  1. Chinese transitive patterns

Assuming three notional signs wo (I), chi (eat) and dianxin (Dim Sum), there are maximally 6 possible combinations in surface word order, out of which 3 are grammatical in Chinese.[1]

5a)           wo chi le dianxin.                                   SVO
5b)           wo dianxin chi le.                                   SOV
5c)           dianxin wo chi le.                                    OSV

SVO is the canonical word order for Chinese transitive structure. When a string of signs matches the order NP V NP, the semantic constraint has to yield to syntax for interpretation.

NP V NP: SVO

6)  daodi         shi     ni             zai         du       shu          ne,
haishi                 shu           zai         du       ni             ne?

     on-earth     be     you          ZAI        read     book        NE,
or                        book        ZAI        read     you          NE?

Are you reading the book, or is the book reading you, anyway?
Note:        ZAI is a particle for continuous aspect.
NE is a sentence final particle for or-question.

Same as in the English equivalent, the interpretation of  6) can only be SVO, no matter how contradictory  it might be to our common sense. In other words, in the form of NP V NP, syntax plays a decisive role.

In contrast, to interpret the form NP NP V as SOV in 2b), the semantic constraint is critical. Without the enforcement of the semantic constraint, the interpretation of SOV does not  hold. In fact, this SOV pattern (NP1 NP2 V: SOV) has been regarded as ungrammatical in a Case Theory account for Chinese transitive structure in the framework of GB. According to their analysis, something similar to this pattern constitutes the D‑Structure for transitive pattern and Chinese is an underlying SOV language (called "SOV Hypothesis": see the survey in Gao 1993). In the surface structure, NP2 is without case on the assumption that V assigns its CASE only to the right. One has to either insert the case-marker ba to assign CASE to it (the BA construction) or move it to the right of V to get its CASE (the SVO pattern). This analysis suffers from not being able to account for the grammaticality of sentences like 2b).  However, by distinguishing the deep pattern SOV from the 2 surface patterns (the SVO and the BA construction), the theory has its merit to alert us that the SOV pattern seems to be syntactically problematic (crippled, so to speak). This is an insightful point, but it goes one step too far in totally rejecting the SOV pattern in surface structure. If we modify this idea, we can claim that SOV is a syntactically unstable pattern and that SOV tends to (not must) "transform" to the SVO or the BA construction unless it is reinforced by semantic coherence (i.e. the enforcement of the semantic constraint). This argument in the light of syntax-semantics interaction is better supported by the Chinese data. In essence, our account is close to this reformulated argument, but in our theory, we do not assume a deep structure and transformation. All patterns are surface constructions. If no sentences can match a construction, it is not considered as a pattern by our definition.

This type of unstable pattern which depends on the semantic constraint is not limited to the transitive phenomena. For example, the type of Chinese NP predicate defined in  [Li, W. & McFetridge 1995] is also a semantics dependent pattern. Compare:

7a)  zhe           zhang       zhuozi                  san          tiao          tui.
        this           Cl.         table(furniture)      three        Cl.            leg
This table is three-legged.
Note:        Cl for classifier.

7b) *        zhe           zhang       ditu                          san          tiao          tui.
                this           Cl.           map(non-furniture)  three        Cl.            leg

There is clearly a semantic constraint of the NP predicate on its subject: it should be furniture (or animate). Without this "semantic agreement", Chinese NP is normally not capable of functioning as a predicate, as shown in 7b).

Between semantics dependent and semantics independent patterns, we may have partially dependent patterns. For example, in NP NP V: OSV, it seems that the semantic constraint on the initial object is less important than the semantic constraint on the subject.

8)   shitou                wo              ye   xiang  chi,    kexi      yao       bu      dong.
   stone(non-food)  I(animate) also want  eat,    pity       chew    not      -able

Even stones I also want to eat, but it's such a pity that I am not able to chew them.

If the constraint on the object matches well, is the subject allowed to be semantically deviant?

9) ?          dianxin                     zhuozi                        chi           le.
                Dim-Sum(food)        table(non-animate)  eat           LE.

Those are the marginal cases, a grammar may choose to be more tolerable to accept it or to be more restrained to reject it.

Unlike SOV, but similar to its English counterpart, OSV is one type of Chinese topic constructions and the relationship between the initial O and V is of long distance dependency.

10a)  dianxin      wo     xiangxin   ni           yiwei        Lisi          chi           le.
          Dim-Sum    I         believe     you          think        Lisi           eat           LE

The Dim Sum I believe you think that Lisi ate.

10b) *      Lisi wo xiangxin ni yiwei dianxin chi le.

10b) will not be accepted in our model because (1) it cannot be interpreted as OSV since it violates the semantic constraint on S: dianxin is not animate; (2) it can neither be interpreted as SOV since it violates the configurational constraint: SOV is simply not of a long distance pattern. In fact, NP NP V: SOV is such a restricted pattern in Chinese that it not only excludes any long distance dependency but even disallows some adjuncts. Compare 11a) in the OSV pattern and 11b) and 11c) in the SOV pattern:

11a)  dianxin      wo           jinjinyouwei             de2           chi           le.
          Dim-Sum      I              with-relish                DE2         eat           LE

The Dim Sum I ate with relish.
Note:        DE2 is a particle introducing a preverbal adjunct of  manner.

11b) *      wo dianxin jinjinyouwei de2 chi le.

11c) *      wo jinjinyouwei de2 dianxin chi le.

There is another pattern of the linear order SOV, the Chinese notorious BA construction. ba is usually regarded as a preposition which introduces a preverbal object for transitive verbs.

NP [ba NP] V: SOV

12a)  wo           ba            dianxin       jinjinyouwei             de2          chi           le.
           I              BA           Dim-Sum     with-relish                DE2         eat           LE

I ate the Dim Sum with relish.

12b)         wo jinjinyouwei de2 ba dianxin  chi le.
With relish, I ate the Dim Sum.

12c)         dianxin  ba wo jinjinyouwei de2  chi le.
The Dim Sum ate me with relish.

12d)         dianxin jinjinyouwei de2 ba wo  chi le.
With relish, the Dim Sum ate me.

For the OSV order, there is another so-called BEI construction. The BEI construction is usually regarded as an explicit passive pattern in Chinese.

NP [bei NP] V: OSV

13a)        dianxin       bei          wo           chi           le.
                Dim-Sum     BEI          I               eat           LE

The Dim Sum was eaten by me.

13b)         wo bei dianxin  chi le.

I was eaten by the Dim Sum.

The BEI construction and the BA construction are both semantics independent. In fact, any pattern resorting to the means of function words in Chinese seems to be sufficiently independent of the semantic constraint.

To conclude, semantic deviation often occurs in some more independent patterns, as seen in 5d2), 6), 8), 12c), 12d), 13b). Close study reveals that different patterns result in different reliance on the semantic constraint, as summarized in the following table.

                syntactic pattern                                 semantic dependence

                NP V NP: SVO                                                    no dependence
                NP [ba NP] V: SOV                                            no dependence
                NP [bei NP] V: OSV                                           no dependence
                NP NP V: OSV                                                    partial dependence
                NP NP V: SOV                                                    full dependence
............

It should be emphasized that this observation constitutes the rationale behind our approach.

  1. Formulation of lexical rules

Based on the above observation, we have designed a syntax-semantics combined model. In this model, we take a lexical rule approach to Chinese patterns and the related problem of semantic deviation.

A lexical rule takes as its input a lexical entry which satisfies its condition and generates another entry. Lexical rules are usually used to cover lexical redundancy between related patterns. The design of lexical rules is preferred by many grammarians over the more conventional use of syntactic transformation, especially for lexicalist theories.

Our general design is as follows, still using chi (eat) for illustration:

(1)   Syntactically, chi (eat) as a transitive verb subcategorizes for a left NP as its subject and a right NP as its object.

(2)   Semantically, the corresponding notion eat expects an entity of category animate as its logical subject and an entity of category food as its logical object. Therefore the common sense (knowledge) that animate being eats food is represented.

(3)   The interaction of syntax and semantics is implemented by lexical rules. The lexical rules embody the linguistic generalizations about the transitive patterns. They will decide to enforce or waive the semantic constraint based on different patterns.

As seen, syntax only stipulates the requirement of two NPs as complements for chi and does not care about the NPs' semantic constraint. Semantics sets its own expectation of animate entity and food entity as arguments for eat and does not care what syntactic forms these entities assume on the surface. It is up to lexical rules to coordinate the two. In our model, the information in (1) and (2) is encoded in the corresponding lexical entry and the lexical rules in (3) will then be applied to expand the lexicon before parsing begins. Driven by the expanded lexicon, analysis is implemented by a lexicalist parser to build the interpretation structure for the input sentence. Following this design, there will be sufficient interaction between syntax and semantics as desired while syntax still remains to be a self-contained component from semantics in the lexicon. More importantly, this design does not add any computational complexities to parsing because in order to handle different patterns, the similar lexical rules are also required even for a pure syntax model.

Before we proceed to formulate lexical rules for transitive patterns, we should make sure what a transitive pattern is. As we defined before, a pattern consists of 2 parts: a structure's syntactic constraint and the corresponding interpretation. Word order is important constraint for Chinese syntax. In addition to word order, we have categories and function words (preposition, particle, etc.). As for interpretation, transitive structure involves 3 elements: V (predicate) and its arguments S (logical subject) and O (logical object). There is a further factor to take into account: Chinese complements are often optional. In many cases, subject and/or object can be omitted either because they can be recovered in the discourse or they are unknown. We call those patterns  elliptical patterns (with some complement(s) omitted), in contrast to full patterns. With these in mind, we can define 10 patterns for Chinese transitive structure: 5 full patterns and 5 elliptical patterns.

We now investigate these transitive patterns one by one and try to informally formulate the corresponding lexical rules to capture them. Please note that the basic input condition is the same with all the lexical rules. This is because they share one same argument structure - transitive structure.

Lexical rule 1:   

                V ((NP1, NP2), (constr1, constr2)) --> NP1 V NP2: SVO

The above notation for the lexical rule should be quite obvious. The input of the rule is a transitive verb which subcategorizes for two NPs: NP1 and NP2 and whose corresponding notion expects two arguments of constr1 and constr2NP is syntactic category, and constr is semantic category (human, animate, food, etc.). The output pattern is in a defined word order SVO and waives the semantic constraint.

Lexical rule 2:   

      V ((NP1, NP2), (constr1, constr2)) --> [NP1, constr1] [NP2, constr2] V: SOV

Please note that the semantic constraint is enforced for this SOV pattern. Since this pattern shares the form NP NP V with the OSV pattern, it would be interesting to see what happens if a transitive verb has the same semantic constraint on both its subject and object. For example, qingjiao (consult) expects a human subject and a human object.

14)           ta                     ni                               qingjiao    guo        me?
                he(human)     you(human)             consult     GUO        ME

Him, have you ever consulted?
Note: GUO is a particle for experience aspect.

15)           ni ta  qingjiao guo  me?

You, has he ever consulted?

In both cases, the interpretation is OSV instead of SOV. Therefore, we need to reformulate Lexical rule 2 to exclude the case when the subject constraint is the same as the object constraint.

Lexical rule 2' (refined version):

                V ((NP1, NP2), (constr1, constr2), (constr1 not = constr2))

                --> [NP1, constr1] [NP2, constr2] V: SOV

Lexical rule 3:

                V ((NP1, NP2), (constr1, constr2)) --> NP1 [ba NP2] V: SOV

This is the typical BA construction. But not every transitive verb can assume the BA pattern. In fact, ba is one of a set of prepositions to introduce the logical object. There are other more idiosyncratic prepositions (xiang, dao, dui, etc.) required by different verbs to do the same job.

16a)      ni             qingjiao    guo         ta             me?
              you          consult     GUO        he            ME

Have you ever consulted him?

16b)         ni             xiang        ta             qingjiao    guo        me?
                 you          XIANG     he            consult     GUO        ME

Have you ever consulted him?

16c) *      ni             ba            ta             qingjiao    guo        me?
                you          BA           he            consult     GUO        ME

17a)         ta             qu             guo         Beijing.
                 he            go-to        GUO        Beijing

He has been to Beijing.

17b)         ta             dao         Beijing     qu             guo.
                 he            DAO        Beijing     go-to        GUO

He has been to Beijing.

17c) *      ta             ba            Beijing     qu            guo.
                 he            BA           Beijing     go-to        GUO

18a)         ta             hen         titie                             zhangfu.
                 she           very       tenderly-care-for      husband

She cares for her husband very tenderly.

18b)         ta             dui          zhangfu       hen        titie.
                 she           DUI         husband      very       tenderly-care-for

She cares for her husband very tenderly.

18c) *      ta             ba            zhangfu         hen                          titie.
                she           BA           husband         very                         tenderly-care-for

This originates from different theta-roles assumed by different verb notions on their object argument: patient, theme, destination, to name only a few. These theta-roles are further classification of the more general semantic role logical object. We can rely on the subcategorization property of the verb for the choice of the preposition literal (so-called valency preposition). With the valency information in place, we now reformulate Lexical rule 3 to make it more general:

Lexical rule 3' (refined version):

       V ((NP1, NP2), (constr1, constr2),  (valency_preposition=P), (P not = null))

       --> NP1 [P NP2] V: SOV

Lexical rule 4:   

                V ((NP1, NP2), (constr1, constr2)) --> NP2 ... [NP1, constr1] V: OSV

This is a topic pattern of long distance dependency. It is up to different formalisms to provide different approaches to long-distance phenomena. In our present implementation, NP2 is placed in a feature called BIND to indicate the nature of long distance dependency. One phrase structure rule Topic Rule is designed to use this information and handle the unification of the long distance complement properly.

Following the topic pattern, the passive BEI construction is formulated in Lexical rule 5.

Lexical rule 5:   

                V ((NP1, NP2), (constr1, constr2)) --> NP2 [bei NP1] V: OSV

We now turn to elliptical patterns.

Lexical rule 6:   

                V ((NP1, NP2), (constr1, constr2)) --> V NP2: VO

19)           chi           guo          jiaozi                        me?
                eat           GUO        dumpling                 ME

Have (you) ever eaten dumpling?

Lexical rule 7:   

                V ((NP1, NP2), (constr1, constr2)) --> [NP1, constr1] V: SV

20)           wo           chi           le.
                I               eat           LE

I have eaten (it).

21)           ji                                 chi           le.
                chicken1(animate)   eat           LE

The chicken has eaten (it).

Like its English counterpart, ji (chicken) has two senses: (1) chicken1 as animate; (2) chicken2 as food. We code this difference in two lexical entries. Only the first entry matches the semantic constraint on the subject in the pattern and reaches the above SV interpretation in 21). Interestingly enough, the same sentence will get another parse with a different interpretation OV in 23) because the second entry also satisfies the semantic constraint on the object in the OV pattern in Lexical rule 8.

22)           ni             qingjiao    guo         me?
                you          consult     GUO        ME

Have you consulted (someone)?

22) indicates that the SV interpretation is preferred over the OV interpretation when the semantic constraint on the subject and the semantic constraint on the object happen to be the same. Hence the added condition in Lexical rule 8.

Lexical rule 8:   

                V ((NP1, NP2), (constr1, constr2), (constr1 not = constr2))

                --> [NP2, constr2] V: OV

23)           ji                                 chi           le.
                chicken2(food)         eat           LE

The chicken has been eaten.

Lexical rule 9:   

                V ((NP1, NP2), (constr1, constr2)) --> NP2 [bei V]: OV

24)           dianxin    bei           chi           le.
                Dim-Sum  BEI          eat           LE

The Dim Sum has been eaten.

Lexical rule 10:

                V ((NP1, NP2), (constr1, constr2)) --> V: V

25)           chi           le             me?
                eat           LE            ME?                        

(Have you) eaten (it)?

  1. Implementation

We begin with a discussion of some major feature structures in HPSG related to handling the transitive patterns.  Then, we will show how our proposal works and discuss some related implementation issues.

HPSG is a highly lexicalist theory. Most information is housed in the lexicon. The general grammar is kept to minimum: only a few phrase structure rules (called ID Schemata) associated with a couple of principles. The data structure is typed feature structure. The necessary part for a typed feature structure is the type information. A simple feature structure contains only the type information, but a complex feature structure can introduce a set of feature/value pairs in addition to the type information. In a feature/value pair, the value is itself a feature structure (simple or complex). The following is a sample implementation of the lexical entry chi for our Chinese HPSG grammar using the ALE formalism [Carpenter  & Penn 1994].

hpsg3

Note:  (1) Uppercase notation for feature; (2) Lowercase notation for type; (3) Number indices in square brackets for unification.

Leaving the notational details aside, what this roughly says is: (1) for the semantic constraint, the arguments of the notion eat are an animate entity and a food entity; (2) for the syntactic constraint, the complements of the verb chi are 2 NPs: one on the left and the other on the right; (3) the interpretation of the structure is a transitive predicate with a subject and an object. The three corresponding features are: (1) KNOWLEDGE; (2) SUBCAT; (3) CONTENT. KNOWLEDGE stores some of our common sense by capturing the internal relation between concepts. Such common sense knowledge is represented in linguistic ways, i.e. it is represented as a semantic expectation feature, which parallels to the syntactic expectation feature SUBCAT. KNOWLEDGE defines the semantic constraint on the expected arguments no matter what syntactic forms the arguments will take.  In contrast, SUBCAT only defines the syntactic constraint on the expected complements. The syntactic constraint includes word order (LEFT feature), syntactic category (CATEGORY feature) and configurational information (LEX feature).  Finally, CONTENT feature assigns the roles SUBJECT and OBJECT for the represented structure.

A more important issue is the interaction of the three feature structures. Among the three features, only KNOWLEDGE is our add-on. The relationship between SUBCAT and CONTENT has been established in all HPSG versions: SUBCAT resorts to CONTENT for interpretation.  This interaction corresponds to our definition of pattern. Everything goes fine as far as the syntactic constraint alone can decide interpretation. When the semantic constraint (in KNOWLEDGE) has to be involved in the interpretation process, we need a way to access this information. In unification based theories, information flow is realized by unification (i.e. structure sharing, which is represented by the co-index of feature values). In general, we have two ways to ensure structure sharing in the lexicon. It is either directly co-indexed in the lexical entries, or it resorts to lexical rules. The former is unconditional, and the latter is conditional. As argued before, we cannot directly enforce the semantic constraint for every transitive pattern in Chinese, for otherwise our grammar will not allow for any semantic deviation. We are left with lexical rules which we have informally formulated in Section 3 and implemented in the ALE formalism.

CATEGORY is another major feature for a sign. The CATEGORY feature in our implementation includes functional category which can specify functional literal (function word) as its value. Function words belong to closed categories. Therefore, they can be classified by enumeration of literals. Like word order, function words are important form for Chinese syntactic constraint. Grammars for other languages also resort to some functional literals for constraint. In most HPSG grammars for English, for example, a preposition literal is specified in a feature called P_FORM. There are two problems involved there. First, at representation level, there is redundancy: P_FORM:x --> CATEGORY:p (where x is not null). In other words, there exists feature dependency between P_FORM and CATEGORY which is not captured in the formalism. Second, if P_FORM is designed to stipulate a preposition literal, we will ultimately need to add features like CL_FORM for classifier specification, CO_FORM for conjunction specification, etc. In fact, for each functional category, literal specification may be required for constraint in a non-toy grammar. That will make the feature system of the grammar too cumbersome. These problems are solved in our grammar implementation in ALE. One significant mechanism in ALE is its type inheritance and appropriateness specifications for feature structures [Carpenter  & Penn 1994]. (Similar design is found in the new software paradigm of Object Oriented Programming.) Thanks to ALE, we can now use literals (ba, xiang, dao, dui, etc) as well as major categories (n, v, a, p, etc.) to define the CATEGORY feature. In fact, any intermediate level of subclassification between these two extremes, major categories and literals, can all be represented in CATEGORY just as handily. They together constitute a type hierarchy of CATEGORY. The same mechanism can also be applied to semantic categories (human, animate, food, etc.) to capture the thesaurus inference like human --> animate. This makes our knowledge representation much more powerful than in those formalisms without this mechanism. We will address this issue in depth in another paper Typology for syntactic category and semantic category in Chinese grammar.

In the following, we give a brief description on how our grammar works. The grammar consists of several phrase structure rules and a lexicon with lexical entries and lexical rules. First, ALE compiles the grammar into a Prolog parser. During this process (at compile time), lexical rules are applied to lexical entries. In the case of transitive patterns, this means that one entry of chi will evolve into 10 entries. Please note that it is this expanded lexicon that is used for parsing (at run time).

At the level of implementation, we do not need to presuppose an abstract transitive structure as input of the lexical rules and from there generates 10 new entries for each transitive verb. What is needed is one pattern as the basic pattern for transitive structure and derives the other patterns. In fact, we only need 4 lexical rules to derive the other 4 full patterns from 1 basic full pattern. Elliptical patterns can be handled more elegantly by other means than lexical rules.[2]

The basic pattern constitutes the common condition for lexical rules. Although in theory any one of the 5 full patterns can be seen as the basic pattern, the choice is not arbitrarily made. The pattern we chose is the valency preposition pattern (the BA-type construction) NP1 [P NP2] V: SOV (see Lexical rule 3').[3] This is justified as follows. The valency preposition P (ba, xiang, dao, dui, etc.) is idiosyncratically associated with the individual verb. To derive a more general pattern from a specific pattern is easier than the other way round, for example,  NP1 [P NP2] V: SOV --> NP1 V NP2: SVO is easier than NP1 V NP2: SVO --> NP1 [P NP2] V: SOV. This is because we can then directly code the valency preposition under CATEGORY in the SUBCAT feature and do not have to design a specific feature to store this valency information.

 

  1. Summery

The ultimate aim for natural language analysis is to reach interpretation, i.e. to assign roles to the constituents. An old question is how syntax (form) and semantics (meaning) interact in this interpretation process. More specifically, which is a more important factor in Chinese analysis, the syntactic constraint or the semantic constraint? For the linguistic data we have investigated, it seems that sometimes syntax plays a decisive role and other times semantics has the final say. The essence is how to adequately handle the interface between syntax and semantics.

In our proposal, the syntactic constraint is seen as a more fundamental factor. It serves as the frame of reference for the semantic constraint. The involvement of the semantic constraint seems to be most naturally conditioned by syntactic patterns. In order to ensure their effective interaction, we accommodate syntax and semantics in one model.  The model is designed to be based on syntax and resorts to semantic information only when necessary. In concrete terms, the system will selectively enforce or waive the semantic constraint, depending on syntactic patterns.

It needs to be advised that there are other factors involved in reaching a correct interpretation. For example, in order to recover the omitted complements in elliptical patterns, information from discourse and pragmatics may be vital. We leave this for future research.

 

References

Carpenter, B. & Penn, G. (1994): ALE, The Attribute Logic Engine, User's Guide, Version 2.0

Gao, Qian (1993): “Chinese BA-Construction: Its Syntax and Semantics”, OSU Working Papers in Linguistics 1993, Kathol A. & Pollard C. (eds.)

Huang, Xiuming (1987): “XTRA: The Design and Implementation of A Fully Automatic Machine Translation System”, Ph.D. dissertation.

Li, Audry (1990): Chapter 6 “Passive, BA, and topic constructions”, Order & Constituency in Mandarin Chinese. Kluwer Academic Publishers

Li, Wei & McFetridge, Paul (1995): “Handling Chinese NP predicate in HPSG”, Proceedings of PACLING-II, Brisbane, Australia

Pollard, Carl  & Sag, Ivan A. (1994): Head-Driven Phrase Structure Grammar,  Centre for the Study of Language and Information, Stanford University, CA

Pollard, Carl  & Sag, Ivan A. (1987): Information-based Syntax and Semantics. Vol. 1: Fundamentals. Centre for the Study of Language and Information, Stanford University, CA

Wilks, Y.A. (1978): “Making Preferences More Active”,  Artificial Intelligence, Vol. 11

Wilks, Y.A. (1975): “A Preferential Pattern-Seeking Semantics for Natural Language Interference”, Artificial Intelligence, Vol. 6

~~~~~~~~~~~~

* This research is part of my Ph.D. project on a Chinese HPSG-style grammar, supported by the Science Council of British Columbia, Canada under G.R.E.A.T. award (code: 61). I thank my supervisor Dr. Paul McFetridge for his supervision. He introduced me into the HPSG theory and provided me with his sample grammars. Without his help, I would not have been able to implement the Chinese grammar in a relatively short time. Thanks also go to Prof. Dong Zhen Dong and Dr. Ping Xue for their comments and encouragement.

 

[1]               The other combinations are:

5d1) *      dianxin chi le wo.              OVS

5d2)         dianxin chi le wo.
The Dim Sum ate me.

Note:        It is OK with the 5d2) reading in the pattern NP V NP: SVO.

5e1) *      chi le wo dianxin.               VSO
5e2)         chi le wo dianxin.

(Somebody) ate my Dim Sum.

Note:        It is OK with the 5e2) reading of in the pattern V [NP1 NP2]: VO where NP1 modifies NP2.

5f1) *      chi le dianxin wo.                 VOS
5f2)         chi le dianxin, wo.

Eaten the Dim Sum, I have.

Note:        It is OK in Spoken Chinese, with a short pause before wo, in a  pattern like V NP, NP: VOS.

[2]   The conventional configurational approach is based on the assumption that complements are obligatory and should be saturated. If saturation of complements were not taken as a precondition for a phrase, serious problems might arise in structural overgeneration. On the other hand, optionality of complement(s) is a real life fact. Elliptical patterns are seen in many languages and especially commonplace in Chinese. In order to ensure obligatoriness of complements, the lexical rule approach can be applied to elliptical patterns, as shown in Section 3. This approach maintains configurational constraint in tree building to block structural overgeneration, but the cost is great: each possible elliptical pattern for a head will have to be accommodated by a new lexical entry. With the type mechanism provided by ALE, we have developed a technique to allow for optionality of complement(s) and still maintain proper configurational constraint. We will address this issue in another paper Configurational constraint in Chinese grammar.

[3]    This choice is coincidental to the base‑generated account of the BA construction in [Li, A. 1990], but that does not mean much. First, our so‑called basic pattern is not their D‑Structure. Second, our choice is based on more practical considerations. Their claim involves more theoretical arguments in the context of the generative grammar.

 

 

[Related]

Handling Chinese NP predicate in HPSG (old paper)

Notes for An HPSG-style Chinese Reversible Grammar

Outline of an HPSG-style Chinese reversible grammar

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

PhD Thesis: Chapter I Introduction

PhD Thesis: Chapter II Role of Grammar

PhD Thesis: Chapter III Design of CPSG95

PhD Thesis: Chapter IV Defining the Chinese Word

PhD Thesis: Chapter V Chinese Separable Verbs

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

PhD Thesis: Chapter VII Concluding Remarks

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

【创业笔记:安娜离职记】

安娜是个很可爱的俄罗斯上进女青年,从小弹钢琴跳芭蕾,小学没毕业即随父母移民美国。她身材高佻,曲线优美,性情温和,举止得体,善解人意,给人一种古典但不古板,现代却不俗艳,阳光而浪漫的印象。大家知道,虽然俄罗斯大嫂大多偏胖粗线条,但俄罗斯姑娘却多有迷人的风采,老帮菜耳熟能详念念不忘的就有钢铁怎样炼成里面的资产阶级小姐冬妮亚,芭蕾舞天后乌兰诺娃,风华绝代的花样滑冰艺术家 Ekaterina Gordeeva。安娜也是这样一位俄罗斯女郎,每天就在身边,给满屋大多是 boys 的办公室带来了温馨柔和的气息。自然地,大家都喜欢她。

然而,安娜辞职了,很快就要离开,大家都舍不得。我心里也不是滋味,想到午餐时不再有她的说说笑笑,餐后也不能邀她打乒乓球了,失落落的。我问她一定要离开么,你不是说很喜欢这个环境么?You know this office is already too crowded with boys, and we are trying to change this situation, trying to find some girls with affirmative action, and you are leaving?

她回说,我喜欢这个环境,是因为在这里我接触的都是你这样的世界上最聪明的人,因为你们太聪明了,结果我的发展道路堵死了,只好痛下决心离开了,我还是去 consulting company 做我擅长的分析工作去吧。两年来,我亲眼目睹我的20小时的人工怎样被你的20秒的全自动搜索所替代,而且结果往往比人工更好更全更有一致性。

她说的不假。确实是技术的转移抢走了她的饭碗,但公司不想辞她,决定让她转型做在线客户服务,可她思前想后,觉得年轻轻不能放弃自己的专长,只好决定离开了。

作为技术带头人,她的离开与我直接相关。这是一个活生生的机器取代人工的例子。

两年前我加入公司的时候,公司基本上是一个 professional service 类型的公司,虽然也开发了一个内部使用的系统,但系统的输出只是缩小了人工范围,必须有长时间的后编辑,手动增删修补,分析归纳,才能提供给客户。编辑人员我们称为信息分析员,要求语言能力强,阅读理解一目十行,并具有分析综合的技能。安娜就是信息分析员中的佼佼者。经她过手的分析报告,客户特别满意。

可是公司需要成本核算。核算的结果是,肉工可以,要适度,否则入不敷出,是亏本买卖。当时平均每个搜索分析的订单需要肉工22小时方能完工,这22小时叫做 pain time (既是分析员的pain, 更是公司的pain)。要想赚钱,理想的 pain time 支出需要控制在两个小时之内,在当时有点天方夜谭。老板找我谈的时候,就把它定为主要目标,但并没有设置时间限度,因为没有人知道其可行性以及达成这样的目标需要多少资源。我自己也不明白,只是感觉到了这个重担。我以前做过的工作,都是先研究,后做原型引擎,然后寻找应用领域,最后开发产品。而这家公司与多数技术创新公司截然相反,它是先有客户,后有粗糙的引擎,最后才引进人才和技术,把希望寄托在技术的快速转移身上。这条路子让我觉得新鲜和刺激,觉得可以试一下,我的技术转移技能能不能如鱼得水,发挥出来。先有客户和应用领域的好处是显而易见的,就像搞共产主义有了遵义会议的明灯一样,省却了在黑暗中的漫长摸索。道路是光明的,就看路怎样走才能赚钱了。

长话短说。我上马以后,三个月把系统的核心部分替换了,半年下来结果明显改善,到一周年的时候,肉工的痛苦时间已经缩短到两小时以下,老板喜不自禁。

人心不足蛇吞象,老板告诉我,Wei,你知道,你的技术给我们的业务带来了革命性变化。我们的立足已经不成问题,只要我们愿意,维持一个机器加人工的服务,发展成年入几千万的企业指日可待。但是,只要有人工,就不能 scale up, 赚钱就有限,盘子就做不大。我知道你是有雄心的人(我心里说,子非鱼),肯定不满足小打小闹。不管多大风险,我们还是决定放弃这条道路,而走全自动的路子,让系统可以服务所有的分析客户,而不是只供我们内部人工(安娜这样的)或者需要专门训练的 power users 使用。我们的目标是让世界上每个分析员都离不开我们,就如大家离不开Google一样。为此,我们必须做到 pain time  为零,这是着险棋,但是前景不可限量。

好家伙,这个口气,就梦想称霸全世界了。美国是个很有意思的地方,这方水土盛产百折不挠,心比天高的企业梦想家。但美国并非梦想家的乐园,95%的梦想家牺牲了,不到5%得以生存,其中不过1%最终做大,真正是一将功成万骨枯。虽然如此,美国造企业梦想家仍然前赴后继,生生不息。我其实很喜欢这些梦想家,他们的坚韧豪情很感染人。

一年又过去了。我们实现了在一个主要分析领域完全铲除痛苦时间的目标(pain time 0),把搜索分析从两年前的22小时人工,发展成为如今的20秒钟全自动立等可取,无需任何人工编辑。

得之桑榆,失之东隅, 两年的奋战取得了超出所有人预料的成就,但同时也失去了一位可爱的俄罗斯女郎。

【二次创业笔记】 记于2008年四月

【后记】关于安娜,还有一个小插曲。大家知道,创业公司的人都爱做梦数小鸡,股票期权则是催梦剂。

有一天,公司哥们跟往常一样数小鸡玩儿,安娜跟我说:Wei, come here, I got something to show you. 我走近一看,是一辆轿车。她跟我一字一板地说:

I like this car. I just love it. It is my dream car. I want to buy it.
Guys, work hard so I can own this car.

及至仔细一看价码,吓了一个筋斗,百万以上,她可真敢想啊,乖乖隆的东,here it is:

http://abcnews.go.com/GMA/Moms/story?id=1406161

相关篇什:

【一日一parsing:舍我其谁,我又是谁?】

昨夜名段:
【中秋,混得好的是花前月下,混得一般的是月下花钱,混得最差的是花下月的钱,混得最好的是钱下月花。】

0916a

0916b

几乎完美parsing了,但有一个分离词没有搭配的瑕疵,对比:

0916d

合在一起就眼花缭乱了,这是非一般的 graph,与多数句法树颇不同:

0916c

索性把前天的 parsing 也秀一秀。汉语 deep parsing 没有绝对的标准,但语言学家心里还是有杆秤的:靠谱不靠谱,内行看门道,外行看热闹罢。这种感觉有些奇诡刺激,一方面觉得是在走前人没走过的路,充满了拓荒者的悲壮与豪情。另一方面,也好像冥冥之中的命定,替天行道,舍我其谁,我又是谁?如果语言是思想的载体和表达(presentation),parsing 就是思想的形式化机器展示(representation),而我就是贯通二者的使者。感谢上帝,在创造了谜一样的语言的同时,没忘记把钥匙留下。

0915a

0915b

0915c

0915d

是的,【人类最无法理解的事情,就是机器对人类语言结构的分析能力】。机器达到人类的语言结构分析能力,现在已经没有悬念了。而机器难以达到的那部分理解能力,可以用人机辅助的方式进行,这个景象就在不太远的将来,已然历历在目了。让我们准备好,去拥抱这个人机交融的新时代。

洪爷有诗云:
庖丁解牛在语言,伟爷Parser之中练。善刀藏之于深山,实则乱麻可以斩。

【相关】

中文处理

Parsing

【置顶:立委NLP博文一览】

《朝华午拾》总目录

 

【博士涂鸦回顾:把常识代入文法的尝试】

上次说过,绝大多数的parsers对于谓词的 subcat 的表达都很简陋,伸展不开,多数不过把 subcat 当成一个代码,然后在相关的 subcat 规则中去确定 pattern。但是词驱动的文法 HPSG 却可以丝丝入扣,合情合理,可以直接在词典里面把 subcat 的 pattern 细致地描述,并对其句法语义的输入(pattern的条件)和输出(逻辑语义)之间的映射和解构,做出一个符合语言学原则的表达(representation)。

简陋有简陋的工程考量和理由,叠床架屋有叠床架屋的逻辑优美。鱼与熊掌不可兼得,我们最终还是更加倾向于简陋之法。尽管如此,走简陋快捷的路线的人,如果对结构表达的优美有所体验,还是有莫大的好处,至少不会被简陋的表象所迷惑,对于复杂的语言现象,逐渐摆脱简陋的捉襟见肘。

最近回看当年博士阶段的涂鸦文章,虽然其中反映出的对汉语句法的见识并不出彩,但是得力于 HPSG 的结构丰富性,还是把 subcat 在汉语文法中应用,表现得有条不紊,经得起时间的检验。当年钻研 HPSG 还是很专心的,吃得蛮透。正因为吃得透了,后来扬弃的时候就没有拖泥带水的牵挂。

譬如,在论及汉语NP带坑的现象的时候,是这样模型的:

11a)     桌子坏了。
11b)     腿坏了。
11c)     桌子的腿坏了。
12a)     他好。
12b)     身体好。
12c)     他的身体好。

When people say 11b) tui huai le (leg went wrong), we know something (the possessor) is omitted. For 11a), however, we have no such feel of incompleteness.

Although we may also ask whose table, this possessive relation between who and table is by no means innate. Similarly, ta (he) in 12a) is a complete notion denoting someone while shenti (body) in 12b) is not. In 11c) and 12c), the possessor appears in the possessive structure DE-construction, the expectation of tui (leg) and shenti (body) is realized. These examples show that some words (concepts) have conceptual expectation for some other words (concepts) although the expected words do not necessarily show up in a sentence and the expectation might not be satisfied. In fact, this type of expectation forms part of our knowledge (common sense). One way to represent the knowledge is to encode it with the related word in the lexicon.
Therefore we propose an underlying SYNSEM feature KNOWLEDGE to store some of our common sense knowledge by capturing the internal relation between concepts. KNOWLEDGE parallels to syntactic SUBCAT and semantic RELATION. KNOWLEDGE imposes semantic constraints on their expected arguments no matter what syntactic forms the arguments will take (they may take null form, i.e. the underlying arguments are not realized). In contrast, SUBCAT only defines syntactic requirement for the complements and gets interpreted in RELATION. Following this design, syntactic form and semantic constraints are kept apart. When necessary, the interaction between them can be implemented by lexical rules, or directly coindexed in the lexicon. For example, the following KNOWLEDGE information will be enforced as the necessary semantic constraints when we handle Chinese NP predicates by a lexical rule (see 3.3).

把常识暗度陈仓从后门带入文法,就是从那时候开始的。这个做法在欧洲语言的形式文法中不多见,因为句法形式大体够用了,通常不需要常识的帮忙。但是对于汉语,没有某种常识的引入,想做一个成熟的深度分析系统,则很难。当年带常识的的句法结构模型是这样定义的:

PHON      shenti
SYNSEM | KNOWLEDGE | PRED [1] possess
SYNSEM | KNOWLEDGE | POSSESSOR [2] human
SYNSEM | KNOWLEDGE | POSSESSED [3]
SYNSEM | LOCAL | CONTENT | INDEX [3]
SYNSEM | LOCAL | CONTENT | RESTRICTION { RELATION body }
SYNSEM | LOCAL | CONTENT | RESTRICTION { INSTANCE [3] }

最后,汉语文法中常识的引入被认为是对欧洲语言利用性数格的 agreement 的一个自然延伸。句法手段到语义限制的延伸。

Agreement revisited
This section relates semantic constraints which embody common sense to the conventional linguistic notion of agreement. We will show that they are essentially the same thing from different perspectives. We only need slight expansion for the definition of agreement to accommodate some of our basic knowledge. This is important as it accounts for the feasibility of coding knowledge in linguistic ways. Linguistic lexicon seems to be good enough to house some general knowledge in addition to linguistic knowledge.

为 parse“我鸡吃“ 和“鸡我吃”, 常识进入了文法(现在也可以利用大数据把常识代入):

A typical example of how concepts are linked in a network (a sophisticated concept lexicon) is seen in the representation of drink ((*ANI SUBJ) (((FLOW STUFF) OBJE) ((SELF IN) (((*ANI (THRU PART)) TO) (BE CAUSE))))) in Wilks 1975b. While for  various reasons we will not go as far as Wilks, we can gain enlightenment from this type of AI approach to knowledge. Lexicon-driven systems like the one in HPSG can, of course, make use of this possibility. Take the Chinese role-assignment problem, for example, the common sense that ANIMATE being eats FOOD can be seamlessly incorporated in the lexical entry chi (eat) as a semantic agreement requirement.

PHON chi
SYNSEM | KNOWLEDGE | PRED [1]  eat
SYNSEM | KNOWLEDGE | AGENT [2] animate
SYNSEM | KNOWLEDGE | PATIENT [3] food
SYNSEM | LOCAL | CATEGORY | SUBCAT | EXTERNAL_ARGUMENT [NP: [4]]
SYNSEM | LOCAL | CATEGORY | SUBCAT | INTERNAL_ARGUMENTS <[NP: [5]]>
SYNSEM | LOCAL | CONTENT | RELATION [1]
SYNSEM | LOCAL | CONTENT | EATER [4] | INDEX | ROGET [2]
SYNSEM | LOCAL | CONTENT | EATEN [5] | INDEX | ROGET [3]

可见,看上去不过是 POS 细分后的一个 subcat 的代码,里面其实包含了多少结构及其蕴含其内的知识。在 unification grammars 几乎成为历史陈迹的今天,我还是认为 HPSG 这样的表达是最优美的语言学的逻辑表达之一,论逻辑的清晰和美,后来的文法很难超越。

 

[Related]

Handling Chinese NP predicate in HPSG (old paper)

Notes for An HPSG-style Chinese Reversible Grammar

Outline of an HPSG-style Chinese reversible grammar

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

 

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

 

Handling Chinese NP predicate in HPSG (old paper)

Handling Chinese NP predicate in HPSG
(old paper in Proceedings of the Second Conference of the Pacific
Association for Computational Linguistics, Brisbane, 1995)

Wei Li & Paul McFetridge

Department of Linguistics
Simon Fraser University
Burnaby, B.C. CANADA  V5A 1S6

 

Key words: HPSG; knowledge representation, Chinese processing 

 

Abstract 

This paper addresses a type of Chinese NP predicate in the framework of HPSG 1994 (Pollard & Sag 1994). The special emphasis is laid on knowledge representation and the interaction of syntax and semantics in natural language processing. A knowledge based HPSG model is designed. This design not only lays a foundation for effectively handling Chinese NP predicate problem, but has theoretical and methodological significance on NLP in general.

In Section 1, the data are analyzed. Both structural and semantic constraints for this pattern are defined. Section 2 discusses the semantic constraints in the wider context of the conceived knowledge-based model. The aim of natural language analysis is to reach interpretations, i.e. correctly assigning semantic roles to the constituents. We indicate that without being able to resort to some common sense knowledge, some structures cannot get interpreted. We present a way on how to organize and utilize knowledge in HPSG lexicon. In Section 3, a lexical rule for this pattern is proposed in our HPSG model for Chinese, whose prototype is being implemented.

  1. Problem

We will show the data of Chinese NP predicate first. Then we will investigate what makes it possible for an NP to behave like a predicate. We will do this by defining both the syntactic and semantic constraints for this Chinese pattern.

1.1. Data: one type of Chinese NP predicate

1) 他好身体。

ta         hao      shenti.
he        good    body
He is of good health.

2)  张三高个子。

Zhangsan         gao      gezi
Zhangsan         tall       figure.
Zhangsan is tall.

3)  李四圆圆的脸。       Lisi

Lisi      yuanyuan         de        lian.
Lisi      round-round    DE       face.
Lisi has a quite round face.

4) 这件大衣红颜色。

zhe       jian      dayi     hong    yanse.
this      (cl.)      coat     red       colour.
This coat is of red colour.

5)  明天小雨。

mingtian          xiao     yu.
tomorrow        little     rain.
Tomorrow it will drizzle.

6)  那张桌子三条腿。

na        zhang   zhuozi san       tiao      tui.
that      (cl.)      table   three    (cl.)      leg
That table is three-legged.

Note:      (cl.) for classifier.
DE for Chinese attribute particle.

The relation between the subject NP and the predicate NP is not identity. The NP predicate in Chinese usually describes a property the subject NP has, corresponding to English be-of/have NP. In identity constructions, the linking verb SHI (be) cannot normally be omitted.[1]

7a)  他是学者。

ta         shi        xuezhe.
he        be        scholar
He is a scholar.

8b) ?他学者。

ta         xuezhe.  他学者。
he        scholar

1.2.  Problem analysis

1.2.1. We first investigate the structural characteristics of the Chinese NP predicate pattern.

A single noun cannot act as predicate. More restrictively, not every NP can become a predicate. It seems that only the NP with the following configuration has this potential: NP [lex -, predicate +].  In other words, a predicate NP consists of a lexical N with a modifying sister. Structures of this sort should not be further modified.[2] Thus, the following patterns are predicted.

8a)      那张桌子三条腿。

na        zhang   zhuozi san       tiao      tui.                   [ same as 6) ]
that      (cl.)      table    three    (cl.)      leg
That table is three-legged.

8b)       那张桌子塑料腿。

na        zhang   zhuozi suliao   tui.
that      (cl.)      table    plastic leg
That table is of plastic legs.

8c) * 那张桌子三条塑料腿。
*    na        zhang   zhuozi san       tiao      suliao   tui.       [too many attributes]

8d) * 那张桌子腿。
*    na        zhang   zhuozi tui.                                           [no attributes]

1.2.2. What is the semantic constraint for the Chinese predicate pattern?

Although there is no syntactic agreement between subject and predicate in Chinese, there is an obvious semantic "agreement" between the two: hao shenti (good body) requires a HUMAN as its subject; san tiao tui (three leg) demands that the subject be FURNITURE or ANIMATE. Therefore, the following are unacceptable:

9) * 这杯茶好身体。

* zhe       bei       cha       hao      shenti.
this      cup      tea       good    body

10) * 空气三条腿。

* kongqi san       tiao      tui.
air        three    (cl.)      leg

Obviously,. it is not hao (good) or san tiao (three) which poses this semantic selection of subject. The semantic restriction comes from the noun shenti (body) or tui (leg). There is an internal POSSESS relationship between them: shenti (body)  belongs to human beings and tui (leg) is one part of an animal or some furniture. This common sense relation is a crucial condition for the successful interpretation of the Chinese NP predicate sentences.

There are a number of issues involved here. First, what is the relationship of this type of knowledge to the syntactic structures and semantic interpretations? Second, where and how would this knowledge be represented? Third, how will the system use the knowledge when it is needed? More specifically, how will the introduction of this knowledge coordinate with the other parts of the well established HPSG formalism? Those are the questions we attempt to answer before we proceed to provide a solution to the Chinese NP predicate. Let us look at some more examples:

11a)     桌子坏了。

zhuozi huai     le.
table    bad      LE
The table went wrong.

11b)     腿坏了。

tui        huai     le.leg       bad      LE
leg       bad      LE
The leg went wrong.

11c)     桌子的腿坏了。

zhuozi  de        tui        huai     le.
table    DE       leg       bad      LE
The table's leg went wrong.

12a)     他好。

ta         hao.
he        good
He is good.

12b)     身体好。

shenti   hao.
body    good
The health is good.

12c)     他的身体好。

ta         de        shenti   hao.
he        DE       body    good
His health is good.

note: LE for Chinese perfect aspect particle.

When people say 11b) tui huai le (leg went wrong), we know something (the possessor) is omitted. For 11a), however, we have no such feel of incompleteness. Although we may also ask whose table, this possessive relation between who and table is by no means innate. Similarly, ta (he) in 12a) is a complete notion denoting someone while shenti (body) in 12b) is not. In 11c) and 12c), the possessor appears in the possessive structure DE-construction, the expectation of tui (leg) and shenti (body) is realized. These examples show that some words (concepts) have conceptual expectation for some other words (concepts) although the expected words do not necessarily show up in a sentence and the expectation might not be satisfied. In fact, this type of expectation forms part of our knowledge (common sense). One way to represent the knowledge is to encode it with the related word in the lexicon.

Therefore we propose an underlying SYNSEM feature KNOWLEDGE to store some of our common sense knowledge by capturing the internal relation between concepts. KNOWLEDGE parallels to syntactic SUBCAT and semantic RELATION. KNOWLEDGE imposes semantic constraints on their expected arguments no matter what syntactic forms the arguments will take (they may take null form, i.e. the underlying arguments are not realized). In contrast, SUBCAT only defines syntactic requirement for the complements and gets interpreted in RELATION. Following this design, syntactic form and semantic constraints are kept apart. When necessary, the interaction between them can be implemented by lexical rules, or directly coindexed in the lexicon. For example, the following KNOWLEDGE information will be enforced as the necessary semantic constraints when we handle Chinese NP predicates by a lexical rule (see 3.3).

PHON      shenti
SYNSEM | KNOWLEDGE | PRED [1] possess
SYNSEM | KNOWLEDGE | POSSESSOR [2] human
SYNSEM | KNOWLEDGE | POSSESSED [3]
SYNSEM | LOCAL | CONTENT | INDEX [3]
SYNSEM | LOCAL | CONTENT | RESTRICTION { RELATION body }
SYNSEM | LOCAL | CONTENT | RESTRICTION { INSTANCE [3] }

  1. Agreement revisited

This section relates semantic constraints which embody common sense to the conventional linguistic notion of agreement. We will show that they are essentially the same thing from different perspectives. We only need slight expansion for the definition of agreement to accommodate some of our basic knowledge. This is important as it accounts for the feasibility of coding knowledge in linguistic ways. Linguistic lexicon seems to be good enough to house some general knowledge in addition to linguistic knowledge. Some possible problems with this knowledge-based approach are also discussed.

Let's first consider the following two parallel agreement problems in English:

13) *    The boy drink.

14) ?    The air drinks.

13) is ungrammatical because it violates the syntactic agreement between the subject and predicate. 14) is conventionally considered as grammatical although it violates the semantic agreement between the agent and the action. Since the approach taken in this paper is motivated by semantic agreement, some elaboration and comment on agreement seem to be in need.

The agreement in person, gender and number are included in CONTENT | INDEX features (Pollard & Sag 1994, Chapter 2). It follows that any two signs co-indexed naturally agree with each other. That is desirable because co-indexed signs refer to the same entity. However, person, gender and number seem to be only part of the story of agreement. We may expand the INDEX feature to cope with the semantic agreement for handling Chinese and for in-depth semantic analysis for other languages as well.

Note that to accommodate semantic agreement in HPSG, we first need features to represent the result of semantic classification of lexical meanings like HUMAN, FOOD, FURNITURE, etc. We therefore propose a ROGET feature (named after the thesaurus dictionary) and put it into the INDEX feature.

Semantic agreement, termed sometimes as semantic constraint or semantic selection restriction in literature, is not a new conception in natural language processing. Hardly any in-depth language analysis can go smoothly without incorporating it to a certain extent. For languages like Chinese with virtually no inflection, it is more important. We can hardly imagine how the roles can be correctly assigned without the involvement of semantic agreement in the following sentences of the form NP1 NP2 Vt:

15a)     点心我吃了。

dianxin            wo       chi       le.
Dim-Sum         I           eat       LE
The Dim Sum I have eaten.

15b)     我点心吃了。

wo       dianxin            chi       le.
I           Dim-Sum         eat       LE
I have eaten the Dim Sum.

Who eats what?  There is no formal way but to resort to semantic agreement enforced by eat to correctly assign the roles. In HPSG 1994, it was pointed out (Pollard & Sag 1994, p81), "... there is ample independent evidence that verbs specify information about the indices of their subject NPs. Unless verbs 'had their hands on' (so to speak) their subjects' indices, they would be unable to assign semantic roles to their subjects." The Chinese data show that sometimes verbs need to have their hands on the semantic categories (ROGET) of both their external argument (subject) and internal arguments to be able to correctly assign roles. Now we have expanded the INDEX feature to cover both ROGET and the conventional agreement features number, person and gender, the above claim of Pollard and Sag becomes more general.

It is widely agreed that knowledge is bound to play an important role in natural language analysis and disambiguation. The question is how to build a knowledge-based system which is manageable. Knowledge consists of linguistic knowledge (phonology, morphology, syntax, semantics, etc.) and extra-linguistic knowledge (common sense, professional knowledge, etc.). Since semantics is based on lexical meanings, lexical meanings represent concepts and concepts are linked to each other in a way to form knowledge, we can well regard semantics as a link between linguistics and beyond-linguistics in terms of knowledge. In other words, some extra-linguistic knowledge may be represented in linguistic ways. In fact, lexicon, if properly designed, can be a rich source of knowledge, both linguistic and extra-linguistic. A typical example of how concepts are linked in a network (a sophisticated concept lexicon) is seen in the representation of drink ((*ANI SUBJ) (((FLOW STUFF) OBJE) ((SELF IN) (((*ANI (THRU PART)) TO) (BE CAUSE))))) in Wilks 1975b. While for  various reasons we will not go as far as Wilks, we can gain enlightenment from this type of AI approach to knowledge. Lexicon-driven systems like the one in HPSG can, of course, make use of this possibility. Take the Chinese role-assignment problem, for example, the common sense that ANIMATE being eats FOOD can be seamlessly incorporated in the lexical entry chi (eat) as a semantic agreement requirement.

PHON chi
SYNSEM | KNOWLEDGE | PRED [1]  eat
SYNSEM | KNOWLEDGE | AGENT [2] animate
SYNSEM | KNOWLEDGE | PATIENT [3] food
SYNSEM | LOCAL | CATEGORY | SUBCAT | EXTERNAL_ARGUMENT [NP: [4]]
SYNSEM | LOCAL | CATEGORY | SUBCAT | INTERNAL_ARGUMENTS <[NP: [5]]>
SYNSEM | LOCAL | CONTENT | RELATION [1]
SYNSEM | LOCAL | CONTENT | EATER [4] | INDEX | ROGET [2]
SYNSEM | LOCAL | CONTENT | EATEN [5] | INDEX | ROGET [3]

Note:        Following the convention, the part after the colon is SYNSEM | LOCAL | CONTENT information.

One last point we would like to make in this context is that semantic agreement, like syntactic agreement, should be able to loosen its restriction, in other words, agreement is just a canonical, in Wilk's term preference, requirement (Wilks 1975a). In practice of communication, deviation in different degrees is often seen and people often relax the preference restriction in order to understand. With semantic agreement, the deliberate deviation is one of the handy means to help render rhetorical expression. In a certain domain, Chomsky's famous sentence Colorless green ideas sleep furiously is well imaginable. On the other hand, the syntactic agreement deviation will not affect the meaning if no confusion is caused, which may or may not happen depending on context and the structure of the language. In English, lack of syntactic agreement for the present third person singular between subject and predicate usually causes no problem. Sentence 15) The boy drink therefore can be accepted and correctly interpreted. There is much more to say on the interaction of the two types of agreement deviation, how a preference model might be conceived, what computational complexities it may cause and how to handle them effectively. We plan to address it in another paper. The interested reader is referred to one famous approach in this direction. (Wilks 1975a, 1978).

 

  1. Solution

We will set some requirements first and then present a lexical rule to see how well it meets our requirements.

3.1. Based on the discussion in Section 1, the solution to the Chinese predicate NP problem should meet the following 4 requirements:

(1)        It should enforce the syntactic constraints for this pattern: one and only one modifier XP in the form of NP1 XP NP2.

(2)        It should enforce the semantic constraints for this pattern: N2 must expect NP1 as its POSSESSOR with semantic agreement.

(3)        It should correctly assign roles to the constituents of the pattern: NP1 POSSESS NP2 (where NP2 consists of XP N2).

(4)        It should be implementable in HPSG formalism.

 

3.2. What mechanisms can we use to tackle a problem in HPSG formalism?

HPSG grammar consists of two components: a general grammar (ID schemata and principles) and a lexical grammar (in the lexicon). The lexicon houses lexical entries with their linguistic description and knowledge representation in feature structures. The lexicon also contains generalizations captured by inheritance of lexical hierarchy and by a set of lexical rules. Roughly speaking, lexical hierarchy covers static redundancy between related potential structures. Just because the lexicon can reflect different degrees of lexical redundancy in addition to idiosyncrasy, the general grammar can desirably be kept to minimum.

The Chinese NP predicate pattern should be treated in the lexicon. There are two arguments for that. First, this pattern covers only restricted phenomena (see 3.4). Second, it relies heavily on the semantic agreement, which in our model is specified in the lexicon by KNOWLEDGE. We need somehow to link the semantic expectation KNOWLEDGE and the syntactic expectation SUBCAT or MOD. The general mechanism to achieve that is structure sharing by coindexing the features either directly in the lexical entries (see the representation of the entry chi in Section 2) or through lexical rules (see 3.3).

3.3. Lexical Rule

Lexical rules are applied to lexical signs (words, not phrases) which satisfy the condition. The result of the application is an expanded lexicon to be used during parsing. Since the pattern is of the form NP1 XP N2, the only possible target is N2, i.e. shenti (body) or tui (leg). This is due to the fact that among the three necessary signs in this form, the first two are phrases and only the final N2 is a lexical sign. We assume the following structure for our proposed lexical rule:

NP[ta[1]]         [[AP[2] hao] [N<NP[1], XP[2]> shenti]]

NP Predicate Lexical Rule

hpsg1

SYNSEM | KNOWLEDGE | PRED [1] possess
SYNSEM | KNOWLEDGE | POSSESSOR [2]
SYNSEM | LOCAL | CATEGORY | HEAD | MAJ [6] n
SYNSEM | LOCAL | CATEGORY | PREDICATE -
SYNSEM | LOCAL | CONTENT | INDEX [4]
SYNSEM | LOCAL | CONTENT | RESTRICTION {[3]}
...| CATEGORY | PREDICATE +
...| CATEGORY | SUBCAT | EXTERNAL_ARGUMENT [NP: [5]]
...| CATEGORY | SUBCAT | INTERNAL_ARGUMENTS < [...| CATEGORY | HEAD | MOD [6] ] >
...| CATEGORY | SUBCAT | INTERNAL_ARGUMENTS < [...| CONTENT | INDEX [4] ]

==>

...| CATEGORY | SUBCAT | INTERNAL_ARGUMENTS < [...| CONTENT | RESTRICTION {[7]} ] >
...| CATEGORY | SUBCAT | INTERNAL_ARGUMENTS < [...| LEX - ] >
...| CONTENT | RELATION [1] possess
...| CONTENT | POSSESSOR [5] | INDEX | ROGET [2]
...| CONTENT | POSSESSED | INDEX [4]
...| CONTENT | POSSESSED | RESTRICTION {[7] | [3] }

For complicated information flow like this, it is best to explain the indices one by one with regards to the example ta hao shenti (he is of good body) in the form of NP1 XP N2.

The index [1] links the underlying PRED feature of N2 to the semantic RELATION feature; in other words, the predicate in the underlying KNOWLEDGE of shenti (body) now surfaces as the relation for the whole sentence. The index [2] enforces the semantic constraint for this pattern, i.e. shenti (body) expects a human (ROGET) possessor as the subject (EXTERNAL_ARGUMENT) for this sentence. The index [3] is the restriction relation of N2. [4] links the INDEX features of XP and N2, and [6] indicates that the internal argument is a de-facto modifier of N2, i.e. XP mods-for N2. Note that the part of speech of the internal argument (INTERNAL_ARGUMENT | SYNSEM | LOCAL | CATEGORY | HEAD | MAJ) is deliberately not specified in the rule because Chinese modifiers (XP) are not confined to one class, as can be seen in our linguistic data. Finally, [7] defines the restriction relation of the XP to the INDEX of N2.

The indices [4], [7] and [3] all contribute to artificially creating a semantic interpretation for [XP N2]. As is interpreted, XP is, in fact, a modifier of N2 and they would form an NP2, or [XP N2] constituent. In normal circumstances, the building of NP2 interpretation is taken care of by HPSG Semantics Principle. But in this special pattern, we have treated XP as a complement of N2, yet semantically they are still understood as one instance: hao shenti (good body) is an instance of good and body. This interpretation of NP2 serves as POSSESSED of the sentence predicate, indicated by the structure-sharing of [4], [7] and [3]. Finally, [5] is the interpretation of NP1 and is assigned the role of POSSESSOR for the sentence predicate.

Let's see how well this lexical rule meets the 4 requirements set in 3.1.

(1) It enforces the syntactic constraints by treating XP as the internal argument and NP1 as the external argument.

(2) It enforces the semantic constraints through structure sharing by the index [2].

(3) It correctly assigns roles to the constituents of the pattern.

The following interpretation will be established for ta hao shenti (he is of good body) by the parser.

hpsg2

CONTENT | RELATION possess
CONTENT | POSSESSOR | INDEX | PERSON 3
CONTENT | POSSESSOR | INDEX | NUMBER singular
CONTENT | POSSESSOR | INDEX | GENDER male
CONTENT | POSSESSOR | INDEX | ROGET human
CONTENT | POSSESSOR | RESTRICTION { }
CONTENT | POSSESSED | INDEX [1]    | PERSON 3
CONTENT | POSSESSED | INDEX          | NUMBER singular
CONTENT | POSSESSED | INDEX          | GENDER nil
CONTENT | POSSESSED | INDEX          | ROGET organ
CONTENT | POSSESSED | RESTRICTION { [ RELATION good],              [ RELATION body  ] }
CONTENT | POSSESSED | RESTRICTION { [ INSTANCE [1] ],              [ INSTANCE [1]  ] }

In prose, it says roughly that a third person male human he possesses something which is an instance of good body. We believe that this is the adequate interpretation for the original sentence.

(4) Last, this rule has been implemented in our Chinese HPSG-style grammar using ALE and Prolog.  The results meet our objective.

But there is one issue we have not touched yet, word order. At first sight, Chinese seems to have similar LP constraints as those in English. For example, the internal argument(s) of a Chinese transitive verb by default appear on the right side of the head. It seems that our formulation contradicts this constraint in grammar. But in fact, there are many other examples with the internal argument(s), especially PP argument(s), appearing on the left side of the head.

服务 fuwu (serve): <NP, PP(wei)>

16a) 为人民服务

wei      renmin fuwu
for       people  serve
Serve the people.

16b) ? 服务为人民。

fuwu    wei      renmin.
serve    for       people

有益 youyi (of benefit): <NP, PP(dui yu)>

17a) 这对我有益。

zhe       dui       wo       youyi
this      to         I           have-benefit
This is of benefit to me.

17b) * 这有益对我。

zhe       youyi               dui       wo
this      have-benefit    to         I

18a) 这于我有益。

zhe       yu        wo       youyi
this      to         I           have-benefit
This is of benefit to me.

18b) 这有益于我。

zhe       youyi               yu        wo
this      have-benefit    to         I
This is of benefit to me.

Word order and its place in grammar are important issues in formulating Chinese grammar. To play safe and avoid generalization too soon, we assume a lexicalized view on Chinese LP constraint, encoding word order information in LEXICON through SUBCAT and MOD features. This proves to be a realistic and precise approach to Chinese word order phenomena.

3.4. As a final note, we will briefly compare the NP Predicate Pattern with one of the Chinese Topic Constructions:

NP1 NP2 Vi/A
(topic + (subject + predicate))

In Chinese, this is a closely related but much more productive form than this NP Predicate Pattern. And their structures are different.

19)       他身体好。

ta         shenti   hao
he        body    good
He is good in health.

For topic constructions, we propose a new feature CONTEXT | TOPIC, whose index in this case is token identical to the INDEX value of ta. Please be advised that in the above structure, the CONTEXT | TOPIC ta is considered as a sentential adjunct instead of a complement subcated-for by shenti. Why? First, ta is highly optional: topic-less sentence is still a sentence. Second, and more convincingly, ta cannot always be predicted by its following noun. Compare:

20a) 他身体好。

ta         shenti   hao
he        body    good
He is good in health.

20b) 他好身体。

ta         hao      shenti
he        good    body
He is of good health.

21a) 他脾气好。

ta         piqi                  hao
he        disposition       good
He is good in disposition.

21b)  他好脾气。

ta         hao      piqi
he        good    disposition
He is of good disposition.

but:

22a) 她学习好。

ta         xuexi   hao. [3]
he        study   good
He is good in study.

22b) *  他好学习。

ta         hao      xuexi
he        good    study

What this shows is that for topic sentences like ta shenti hao (He is good in health), ta xuexi hao (He is good in study), etc., there is no requirement to regard topic ta (he) as a necessary semantic possessor of shenti / xuexi, the relation is rather "in-aspect": something (NP1) is good (A) in some aspect (NP2), or for something (NP1), some aspect (NP2) is good (A).

Finally, it needs to be mentioned that our proposed lexical rule requires modification to accommodate sentence 6). That is already beyond what we can reach in this paper because it is integrated with the way we handle Chinese classifiers in HPSG framework.

 

References

Pollard, Carl  & Sag, Ivan A. (1994): Head-Driven Phrase Structure Grammar,  Centre for the Study of Language and Information, Stanford University, CA

Pollard, Carl & Sag, Ivan A. (1987): Information‑based Syntax and Semantics Vol. 1: Fundamentals. Centre for the Study of Language and Information, Stanford University, CA

Wilks, Y.A. (1975a): A Preferential Pattern-Seeking Semantics for Natural Language Interference.  Artificial Intelligence, Vol. 6, pp.53-74.

Wilks, Y.A. (1975b): An Intelligent Analyzer and Understander of English, in Communications of the ACM, Vol. 18, No.5, pp.264-274

Wilks, Y.A. (1978): Making Preferences More Active.  Artificial Intelligence, Vol. 11, pp. 197-223

~~~~~~~~~~~~~~~ footnotes ~~~~~~~~~~~~~~~~

[1] This is not absolute, we do have the following examples:

Ia)          约翰是纽约人。

Yuehan shi           Niuyue                   ren
John       be            New-York              person
John is a New Yorker.

Ib)           约翰纽约人。

Yuehan  Niuyue                   ren.
John       New-York              person
John is a New Yorker.

IIa)         今天是星期天。

jintian    shi           xingqi-tian.
today     be            Sun-day
Today is Sunday.

IIb)         今天星期天。

jintian    xingqi-tian.
today     Sun-day
Today is Sunday.

It seems to be that the subject NP stands for some individual element(s), and the predicate NP describes a set (property) where the subject belongs. But it is not clear how to capture Ib) and IIb) while excluding 7b). We leave this question open.

[2] We realize that the syntactic constraint defined here is only a rough approximation to the data from syntactic angle. It seems to match most data, but there are exceptions when yi (one) appears in a numeral-classifier phrase:

IIIa)  他一副好身体。

ta            yi             fu            hao         shenti.
he            one         (cl.)         good       body
He is of good health. (He is of a good body.)

IIIb) * 他三副好身体。

ta            san          fu            hao         shenti
he            three       (cl.)         good       body

IIIc)   他好身体。

ta            hao         shenti.    [same as 1) ]

IVa) 李四一张圆圆的脸。

Lisi          yi             zhang     yuanyuan             de            lian.
Lisi          one         (cl.)         round-round         DE          face
Lisi has a quite round face.

IVb) * 李四两张圆圆的脸。

Lisi          liang       zhang     yuanyuan             de            lian.
Lisi          two         (cl.)         round-round         DE          face

IVc)  李四圆圆的脸。

Lisi          yuanyuan             de            lian.        [ same as 3) ]

[3] Another reading for 22a) is [S [Sta xuexi][AP hao]], where ta xuexi is a subject clause: "That he studies is good". This is another issue.

 

[Related]

Interaction of syntax and semantics in parsing Chinese transitive verb patterns 

Notes for An HPSG-style Chinese Reversible Grammar

Outline of an HPSG-style Chinese reversible grammar

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

PhD Thesis: Chapter I Introduction

PhD Thesis: Chapter II Role of Grammar

PhD Thesis: Chapter III Design of CPSG95

PhD Thesis: Chapter IV Defining the Chinese Word

PhD Thesis: Chapter V Chinese Separable Verbs

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

PhD Thesis: Chapter VII Concluding Remarks

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

Notes for An HPSG-style Chinese Reversible Grammar

ABSTRACT

Key words: Chinese parsing, Chinese generation, reversible grammar,  HPSG

This paper presents a reversible Chinese unification grammar named CPSG. The lexicalized and integrated design of CPSG embodies the general spirit of the modern linguistic theory Head-driven Phrase Structure Grammar (HPSG, Pollard & Sag 1987, 1994). Using ALE formalism in Prolog (Carpenter & Penn 1994), we have implemented a prototype of CPSG.

CPSG covers Chinese morphology, Chinese syntax and semantics in a novel integrated language model (Figure 1, for interface between morphology, see Li 1997; for interface between syntax and semantics, see Li 1996). CPSG model is in sharp contrast to the conventional clear-cut successive design of grammar components (Figure 2, see survey in Feng 1996). We will show that our model is much better suited and more efficient for Chinese analysis (or generation).

 

cpsg

Grammar reversibility is a highly desired feature for multi-lingual machine translation application (Hutchins & Somers 1992, Huang 1986, 1987). To test its reversible features, we have applied the CPSG prototype to an experiment of bi-directional machine translation between English and Chinese. The machine translation engine developed in our Natural Language Lab is based on shake-and-bake design, a novel approach to machine translation suited for unification grammars (Whitelock 1992, 1994, Beaven 1992, Brew 1992). The experimental results meet our design objective and verify the feasibility of CPSG approach.

~~~~~~~~~~~~~~~~~~~~~

Notes for NWLC-97, UBC, Vancouver

Outline of An HPSG-style Chinese Reversible Grammar

Wei LI   ([email protected])

Linguistics Department, Simon Fraser University

 

 Key words:          lexicalist approach, integrated language model, HPSG,

                                reversible grammar,  bi-directional machine translation, 

                                Chinese computational grammar,

                                Chinese word identification, Chinese parsing,
Chinese generation

 

  1. background

1.1. design philosophy

Two major obstacles in writing Chinese computational grammar:

lacking in serious study on Chinese lexical base

well designed lexicon is crucial for a successful computational system

theoretical linguists have made fruitful efforts (e.g. Li Linding) but lack formalization

computational linguists require more patience in adapting and formalizing the fruits:

it is huge work, but has to be done if a non-toy system is targeted

lack of effective interaction between morphology, syntax and semantics.

e.g.

ambiguity in word identification makes it hard to interface morphology & syntax:

a theoretical defect of morphology preprocessor (segmenter)

e.g. ABC: ABC or A | BC or AB | C or A | B | C?

active/passive isomorphic phenomena make semantic constraint a desired need in parsing NP Vt: subject NP or object NP?

Solution: the lexicalized and integrated design of Chinese grammar

1.2. major theoretical foundation:

HPSG:       lexicalist theory encouraging integration of different components

a desired framework matching our design philosophy

CPSG: HPSG-style unification grammar

CPSG: reversible grammar suited for both parsing and generation

CPSG: formalized grammar, a description that does not rely on undefined notions

  1. integrated language model

2.1. CPSG versus conventional Chinese grammar

 

 

parse tree embodies both morphological and syntactic structures in CPSG

  1. lexicalized formal grammar

3.1. formalized grammar, as required by a computational grammar: formulation of CPSG

readily implementable (theories, principles, rules, etc.);

precise definition for the very basic notions (e.g. sign, morpheme, word, phrase, sentence, NP, VP, etc.), rules (PS rules and lexical rules), lexical items (lexical hierarchy), typology (hierarchy embodied in feature structures)

(4.)       Definition: sign

A sign is the most fundamental concept of grammar. Formally, a sign is defined by the type [a_sign], which introduces a set of linguistic features for its description, as shown below.

a_sign
INDEX index
KANJI kanji
MORPH1 expected
MORPH2 expected
CATEGORY category
COMP0 expected
COMP1 expected
COMP2 expected
MOD expected
KNOWLEDGE knowledge
CONTENT content
INDEX0 index
INDEX1 index
INDEX2 index
DTR dtr

(5.)       Definition: word

In CPSG, a word is a sign satisfying the following two conditions: (1) its obligatory morphological expectation has all been saturated; (2) it is not a mother of any syntactic structures, hence no syntactic daughters. Formally, a word is defined as shown below.

(6.)       word

a_sign
MORPH1 ~obligatory
MORPH2 ~obligatory
DTR no_syn_dtr

3.2. lexicalized grammar

CPSG consists of two parts:

(1) a minimized general grammar:

only 11 phrase structure rules
(covering complement structure, modifier structure,
conjunctive structure and morphological structure)

(2) a feature enriched lexicon:

lexical entries;
lexical hierarchy and a set of lexical rules
(capturing lexical generalizations).

 

(7.)          comp0 PS rule

MOTHER               a_sign
COMP0 saturated
COMP1 [1]
COMP2 [2]
DTR comp0
MYSISTER [6]
LEFTMOD [7] category
RIGHTMOD [8] category
LEFTCOMP [9] category
RIGHTCOMP [10] category

===>

EXPECTING          a_sign
COMP0 a_expected
DIRECTION left
ROLE [3]
SIGN [4]
COMP1 [1] ~obligatory
COMP2 [2] ~obligatory
INDEX [5]
DTR dtr
LEFTMOD [7]
RIGHTMOD [8]
RIGHTCOMP [10]

EXPECTED            a_sign [4]
CONTENT content
MYHEAD [5]
MYROLE [3] comp_role
INDEX [6]
CATEGORY [9]

PRINCIPLE            #head_feature

(8.)          lexical entry: chi

a_sign
KANJI one_character
H1 chi
CATEGORY v
INDEX0 [1] index
INDEX1 [2] index
COMP0 a_expected
DIRECTION left
SIGN a_sign
CATEGORY n
INDEX [1]
COMP1 a_expected
DIRECTION right
SIGN a_sign
CATEGORY n
INDEX [2]
KNOWLEDGE eat
U_OBJECT food
MALE none
PERSON 3
SINGULAR bin
U_SUBJECT animate
MALE bin
PERSON tri
SINGULAR bin

  1. Implementation and Application of CPSG

CPSG prototype implemented in ALE and Prolog, having parsed a corpus of 200 various types of sentences

ALE and Prolog: suitable for unification grammar
ALE:         mechanism for typed feature structures: type polymorphism
a powerful tool in language modeling

CPSG prototype adapted for application to bi-directional MT, having generated the same corpus of 200 sentences

References

Beaven, John L. (1992): "Shake and Bake Machine Translation", Proceedings of the 15th International Conference on Computational Linguistics, pp. 603-609, Nantes, France.

Brew, Chris (1992): "Letting the Cat out of the Bag: Generation for Shake-and-bake MT", Proceedings of the 15th International Conference on Computational Linguistics, pp. 610-616, Nantes, France.

Carpenter, B. & Penn, G. (1994): ALE, The Attribute Logic Engine, User's Guide

Feng, Z.  (1996): "COLIPS Lecture Series - Chinese Natural Language Processing",  Communications of COLIPS, Vol.6, No.1 1996, Singapore (http://www.iscs.nus.sg/~colips/commcolips/paper/p96.html)

Huang, X-M. (1986): "A Bidirectional Grammar for Parsing and Generating Chinese".  Proceedings of the International Conference on Chinese Computing, Singapore, pp. 46-54

Huang, X-M. (1987): XTRA: The Design and Implementation of A Fully Automatic Machine Translation System, Doctoral dissertation, University of Essex.

Hutchins, W.J. & H.L. Somers (1992): An Introduction to Machine Translation. London, Academic Press.

Li, W. (1996): Interaction of Syntax and Semantics in Parsing Chinese Transitive Patterns. Proceedings of International Conference on Chinese Computing (ICCC'96), Singapore

Li, W. (1997): Chart Parsing Chinese Character Strings. Proceedings of The Ninth North American Conference on Chinese Linguistics (NACCL-9, to be available), Victoria, Canada

Pollard, C.  & I. Sag (1987): Information based Syntax and Semantics Vol. 1: Fundamentals. Centre for the Study of Language  and Information, Stanford University, CA

Pollard, C.  & I. Sag (1994): Head-Driven Phrase Structure Grammar,  Centre for the Study of Language and Information, Stanford University, CA

Whitelock, Pete (1992): "Shake and Bake Translation", Proceedings of the 14th International Conference on Computational Linguistics, pp. 784-790, Nantes, France.

Whitelock, Pete (1994). "Shake and Bake Translation", C.J. Rupp, M.A. Rosner, and R.L. Johnson (eds.), Constraints, Language and Computation, pp. 339-359, London, Academic Press.

 

[Related]

Outline of an HPSG-style Chinese reversible grammar

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

PhD Thesis: Chapter I Introduction

PhD Thesis: Chapter II Role of Grammar

PhD Thesis: Chapter III Design of CPSG95

PhD Thesis: Chapter IV Defining the Chinese Word

PhD Thesis: Chapter V Chinese Separable Verbs

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

PhD Thesis: Chapter VII Concluding Remarks

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

【一日一parsing:从“见面”的subcat谈起】

白:
“三两面”和“两三面”很不一样啊……
我借过他三两面。我见过他两三面。

我:
三两面 > 两三面
我见过他三两面

0912a
ditransitive, no problem, but:

0912b

separable verb jian-mian is still not connected

还有:
(0)我见过他两三面。
(1)我见过他。
(2)我与他见过面。
(3)* 我见过面
(4)我们见过面。
(5)我与他,见面过。

“见面” 要求或者主语是复数(4),或者主语是并列结构(5),或者带有介词短语“与(with)”(PP或并列在汉语界限不清,(2)),或者动量词疑似的“两三面”前必须有定语【human】。所有的这些句法subcat要求都是满足语义(或常识)的一个【human】的坑:常识是,“见面“”必须在两个或以上的 human entities 之间进行。

HPSG 这类极端依赖subcat数据结构的词驱动的理论和语言学表达,尽管繁缛,但有一个亮点, 就是把上述的句法要求作为 input 的匹配条件描述,与内在的语义要求(类似于 HowNet 的描述)作为语义的 output,一条一条形式化,细致入微,丝丝入扣。用的是 label 的unification(就是 label 所代表的子结构的 sharing)机制。多数系统对于 subcat 的内部结构,input到output的映射,以及背后的句法与语义的关系(语义是句法的动因,同时也是句法的目标:句法匹配,语义实现),都显得太简陋了。

过犹不及,不及犹过。我们一直在探索在 subcat 的表达和实现中,如何做到中庸而不平庸,简约而不简陋。

白:
他我见过几面

我:
简陋之极的一个例证是给人用的 Oxford 高级词典和朗曼词典的那些 subcat codes,类似 v1,。。。v23 之类。后来纽约大学专门组织CL的研究生做 CompLex 和 NomLex 等 subcat 词典。中文方面,社科院语言所的【现代汉语800词】开 subcat 先河,【动词用法词典】等系列辞典,开始试图把 subcat 用某种编码加例句予以表达。所有这些工作,从数据表达和关系看,都显得有些简陋。其根子是,句法和语义没有厘清。

对于一个 NLP practitioner,拿来这些资源,必须在肚子里做这个句法语义的连接和消化,然后确定数据结构,找寻自己的实现途径。实现的时候,很难达到 unification 文法的漂亮,大多是凑合事儿,为的是避免 HPSG 这类的实现起来的低效率和数据结构的难维护。

董老师的 HowNet 对于汉语和英语的 subcat,语义上登峰造极了,但是句法方面还是显得不够细致周全。譬如“见面”这类的上述6-7种句法规定,好像就没有一一描述(董老师指正:也许我没吃透),也没见哪家描述清楚过。也都需要一个重新咀嚼消化,然后去实现。

0912c

(3)的 generation 不合法(*),但对于 parsing,鲁棒性要求这样parsing,没错。

0912d

没调试,居然出来了,912 的狗屎运吧。(911恐袭,913林跑,都不是好日子。)只剩下 “我见过他两三面” 这个 case 了。这个类似动量补语的东西其实仅限于:“一面”,“几面”,“两三面”,“三两面”,等少数几个。起码,100+ 面 基本不可能 除非是恋人。

张: 崇拜严重中

我:
张老师谬赞。清谈误国,我只要不误“人”子弟就好了,一辈子没当过教授,要误也都是人家子弟,哈。

张: 白求恩

我:
认真说,其实真地涉嫌误人子弟,因为凡事都有一个大环境和背景,我说的这些个多少有些异类,结果是,主流学生雾里看花。雾里看花也算增加视野,最误人的是,看到花,却够不着。这就好比鲁老爷子说的,本来人家黑屋子里面睡得蛮香甜,你非要去【呐喊】,唤醒了,可屋子还是黑屋子,这就不仅仅是残忍了。不残忍的法子就是,等以后退休了,开一个 Deep Parsing 开源公园,每条代码,每个词条,每段规则,全部公开,然后看看能不能靠众人的力量,弄一个无敌系统来。大家一起玩符号逻辑,让两条路线永远。

 

 

【相关】

中文处理

Parsing

【置顶:立委NLP博文一览】

《朝华午拾》总目录

 

【语言学小品:苹果发布 iPhone 7 的“话术”】

我:
前一阵提到汉语 if-then 简约式对parsing的挑战。昨天又遇到一些例子,也是极少显性形式痕迹,可是人就理解为 if-then: “中国出生,美国长大,如何申请回国定居?”

VP1, VP2, how VP3

中国出生,美国长大,如何申请回国定居?
== 【如果】【一个人】【在】中国出生,【并且】【在】美国长大,【那么】【他/她】【将】如何申请【他/她】回国定居【的paperwork】【呢】?

省去了多少玩意儿,简约的中文!

这种句式听起来很顺耳,普罗没感觉有理解或缺省的问题。仔细看,也不能算没有形式痕迹,这样的 pattern 似乎就应该是这样的理解(?):

VP1(, VP2, ...), how VPn?

一旦匹配上,还有其他的语义可能吗?VP1 到 VPn-1 都是 AND 条件, VPn 才是虚拟条件的结果。

白:
不甜不要钱,不甜的不要钱
一个意思,形式上真要拉开那么大差距吗
理解为省略“的”,就是单。理解为省略“如果……则……”就是复句

我:
的字结构,是一个短语与从句的中间怪物,英语的 what-clause 亦然。

白:
如果依照“懒人定律”,无论如何过程简约、结果简约的理解优先。
用最小能量补齐者优先

宋:
不完全一样。瓜主指着一堆瓜说“不甜不要钱”,意思是我保证个个都甜。“不甜的不要钱”口气软一点,是说我不保证每个瓜都甜,如果你买到的瓜不甜,我就不收钱。

白:
您的例子只能区分省略掉的名词加的是存在量词还是全称量词,不能区分省略掉的小词是“的”还是“如果……则”

我:
@宋 好区分。不过,这种口气的软硬真地很 subtle,广告商似乎常常有意利用这种 nuances,来忽悠老百姓。同样的广告词,软的方面理解才是实在的,广告商希望听众往硬的方面理解,来凸显其底气。“不甜不要钱”,就是这样的话术。它的实际意义和法律意义等价于“不甜的不要钱”。但它想传达的却是,我的产品多牛,根本不可能不甜,不信我愿意跟你打赌。

白:
不管软硬,真遇到不甜的(逻辑反例),肯定是哪个瓜不甜哪个瓜不要钱,不会整堆儿不要钱。不信试试。

我:
不用试吧,@白硕

说到“话术”,昨天看苹果发布会,体会才深,从乔布斯时代到现在,苹果最经常用到的忽悠信众和普罗的话术就是:iXyz is the best Xyz ever made by Apple
这种话是宇宙真理,没有丝毫信息量,却听上去似乎是最有力量的广告词。

白:
有sentiment就够了

我:
尼玛做电子产品,不是越做越好,难道越做越坏?新一代比起前面的几代好,不是理所当然吗?这里的 best 不就是这么声称吗?屡试不爽,把全世界当傻瓜,可是全世界还就愿意当傻瓜。没人 question 或反讽。我要是苹果的竞争对手,就专门做一个宣传片,嘲讽这个“话术”。

白:
made和比较范围并没有硬捆绑呀。
不是硬性的

我:
是 iPhone7 与 iPhones 比较;iWatch Series 2 与 iWatch 1 比较

白:
也可以理解为横向比

我:
这是正式新闻发布:
San Francisco — Apple today introduced iPhone 7 and iPhone 7 Plus, the best, most advanced iPhone ever, packed with unique innovations that improve all the ways iPhone is used every day.“the

“the best, most advanced iPhone ever”

白:
又回到限定性非限定性问题上,聪明的一休

我:
逻辑上,剔除定语,就是 iPhone 7 is iPhone

白:
这个跟“媳妇是娘”那种剔除法一样不可取。

我:
苹果就是完全烂了,没有任何创新,也永远可以这样声称:
iPhone 7 is the best iPhone.
(iPhone 8 will be the best iPhone)
In fact, a new iPhone release is always the best iPhone.

白:
问题是,把苹果买在手里的用户,按照另一种理解,会有一种傲视天下的感觉。

宋:
马列主义的顶峰。
新顶峰。

我:
他要是真牛,应该说 iPhone 7 is the best smart phone.
不过他不敢

白:
苹果不蠢,只是蒙不了伟哥而已。

我:
只有谷歌 SyntaxNet 才傻乎乎地敢于不带范围地如此声称世界第一

 

 

【相关】

【汉语句法的挑战之一:if-then的简约式】

【置顶:立委NLP博文一览】

《朝华午拾》总目录

 

《立委科普:NLP系统语义模块的任务》

本篇旨在探讨NLP(Natural Language Processing)语义模块的任务,尤其在知识图谱应用中。探讨之前,我们先站在万米高看俯瞰一下语义模块在语言学和NLP的主要模块的架构中位于何处。
语言学的教科书通常把语言文本研究从浅入深划分为这么几个分支:词法(morphology)、句法(syntax)、语义(semantics)和语用(pragmatics)。还有另一个维度的分支,叫篇章研究(discourse study),是跨句进行,其他的研究一般限于句内。词法句法的研究成果在 NLP 中表现为 parser,可以自动把线性字符串的语句分析为句法树结构,千变万化的语句因此化为有限的句型或 patterns,为语言理解和应用提供了坚实的基础。语义处于句法之后、语用之前,我们叫它为语义中间件 (middleware),因为它是领域独立的语言研究的终点,支持的是依赖领域和应用的语用。这个语义中间件的任务也可以留到语用阶段在语义落地(semantic grounding)的时候根据语用对语义的要求来一起做,但是理论上,总有一部分语义工作有足够的领域独立性,值得提前做好,来支持种种不同的语用场景和应用,减轻语用模块的负担。
如此定义的语义模块(语义中间件),主要是寻找 hidden links,譬如隐含的逻辑主语、宾语 等。这些在句法阶段没有显性表明,但是有足够证据去确定如何填补。填补的时候,一个是利用句法(显性的links),一个是利用 ontology,通常是二者的结合。词驱动(word-driven)来做,是一个很 tractable 的任务,是比parsing更琐碎但难度较低的工作,因为要结构有结构,要ontology有ontology(包括动态形成的ontology节点,譬如NE专名的分类),条件比纯句法分析模块只有线性的pattern可用,是成熟多了。其有用性还是不太清晰: argument 之一就是,如果 hidden 的语义重要,人为什么不用显性句法手段?即便在一个句子的选定的句法结构中,某个重要的语义难以显性表达,如果足够重要,人就会换一种句法结构在另一个句子显性表达出来。如果上述 argument 有一定的道理,那么不做 hidden 语义,对于大数据挖掘,应该不会有太大的损害。至少在大数据挖掘这样的场景,信息的冗余性足以弥补 个体 hidden 语义的不全。在句法结束的时候,有些句子提到的 arg(s) 并没有到位,可以说是不饱和(unsaturated)。语义中间件的任务就是把句法没有做全的不饱和的坑填得饱和,hidden links 建立了,于是就饱和了。如果句法模块和语义模块以后,仍然不饱和,就应该在 discourse 中去找。如果 discourse 中还是没找到,那么理论上是应该通过常识去饱和它。
回到万米高空俯瞰,昨天还在想所谓“语义计算”到底包含哪些呢。从 community 来看,相关的方面有:(1)WSD(Word Sense Disambiguation); (2) FrameNet (role labeling); (3) IE(Information Extraction)。“经典”IE (MUC IE 传统)里面一般分 NE、relationship、event,外加 Coreference,等任务。从结构图的角度看,NE 和 WSD 是做 node 的语义计算;FrameNet 和 IE Template (for relationship or event) 是做 arc (link)的语义计算。这样来看 community 定义的几个任务和方向,可以发现,(1) 和 (2) 都是学究式的任务,不实用。(3) 是最接地气的东西,是应用(apps)直接需要的。但是 IE 是针对领域的,直接为产品服务的,不好抽象,那么就可以想想什么东西是句法之后,语用之前,最能帮助 IE。其中之一就是 Coreference,这个任务已经被 IE 收编了,但它实际上是独立于领域的篇章(discourse)尺度的语义计算,是为了支持 IE 的跨句整合的。
沿着这个思路,我们还可以细化,根据实际需求,我们定义过三个任务,觉得应该在语义中间件里面做,它们应该可以惠及所有的应用:第一个是 同位语关系,这个可以看成是 Corference之一种;第二个是 部分和整体的关系(譬如,苹果和iPhone);第三个 原因和结果的关系。上述三个关系不限于句法短距离,也包括远距离的,甚至跨句的这类联系。我们一直在这三个关系,加上代词的coreference (包括专名的 aliasing) 上下功夫,比在 hidden 逻辑主谓宾方面更多,因为前者直接服务于 local IE 以后的 IE,以便整合成图谱,是整合的粘合剂,后者大多可以通过信息冗余去做弥补。
以上说的是实践中摸索出来的体会,就是自然而然这么走下来的。local IE 在抓取信息填 IE Template 里面的坑的时候,所看到的都是局部的信息,所填坑的材料经常很“虚”。虚的极端例子就是代词(“它”,“这个”),或者 一些指代性的名词(“这台电脑”),这些东西只能作为桥梁,不能真正导致图谱。这时候语义模块在上述四个方面所做的工作,就可以帮助把这些虚的材料,变得实在,这是通向图谱的一个很重要的支持。
大而言之,语义中间件做到什么程度合适,有很大的争论空间。在确定应用之前,不少细线条语义进一步伸展没有太大意义,或者劳而少功。就是说在句法把结构的框架搭起来以后,在语用层面的具体应用确定之前,到底要做多少语义计算,不是容易说清楚的,直觉上和经验上,不赞成做得太多。从某种意义上看,费尔默创立 FrameNet 就是想把语义中间件进行到底。理论上,他的深入是有道理的,因为在 arg structure (句法subcat的拿手好戏)之后,如果要深入,domain independent 的 Frame hierarchy 是通向语用的深度桥梁。起码理论上如此。但是我们做了18年的 IE 以后,结论是,费尔默那个语义计算的路子基本是歧途。没感觉到啥好处,却带来了很大的 overhead,可操作性很差,也并不省功。IE 领域用 Template 定义语用领域的需求,没有人主张把这些 Templates 定义在 FrameNet 的 hierarchy 上面,因为感觉不到需要,而且也不现实。100 年后,也许 FrameNet 可以被重新发现,因为那时候的语用落地已经太多了,需要组织组织了。FrameNet 正好提供了一个组织和整合的框架,如今的语用落地都是零星的。
立委牌 NLP University 中,能看懂上面这些参杂了些假洋鬼子话(术语)的“高阶科普”的后学,是可以授予学位的。这个学位是硬通货。看不懂也没关系,可以视为狂人乱语,或者是误入迷宫,不隔行也如山,耽误了你玩深度学习(dl)的宝贵时间。

【语义计算沙龙:三角关系的 chemistry 种种】

白:
朴泰恒小组成绩不好,今天不一定能进决赛
上面例子,“小组”怎么摆,是个考验。
原意是“在小组赛阶段的”

梁:
朴泰恒今天小组成绩不好。
孙杨小组第一。

白:
以人命名的小组也是存在的

梁:
是啊,感觉“小组成绩不好”是谓语。这里小组也不是“朴泰恒的小组“,考验来了。

我:
不是说大数据吗 看 某某某小组 是不是够资格

t08061

t08062

t08063

t08064

t08065

梁:
@wei 很棒! 有个 Topic.

宋:
@wei 确实很好。但是确实能区分两种“小组”,还是只顾一头?

我:
没有大数据,应该是只顾一头吧,可以试试另一头的典型案例

宋:
即使有大数据,还得区分时代、地域、行业等,不好办。
而且,这就成了有监督的学习了,需要做语料标注。

白:
不一定宋老师。可以词典里离线加标签,目标文本在线只需计算标签密度,不涉及监督学习。

宋:
具体解释一下吗?

我:
词典习得本质上是无监督的 ngram 频率做底。假设北京大学不在词典 应该可以学出来,某某某小组 亦然。白老师说的是在线词典化 通过现场计算。

宋:
@wei 就这个例子而言,对比“朴泰恒小组”和“朴泰恒……小组”的频率,是吗?

我:
能不能解决这个问题:北京大学、中学、小学要立刻全部动员起来
xyz 相交切分的通则:xy 强 还是 yz 强,这个道理上可以在线检索计算
“北京大学” 还是 “大学、中学” 强

宋:
如果看作交搭型歧义问题,那么在大数据中,肯定是“小组成绩”频率高过“朴泰恒”的频率,除非朴泰恒这个人太红。因此,以此决定句法结构,似乎理由不足。

我:
人是怎么决策的呢?
这里可能涉及大数据的范围问题。
数据不是越大越好 尤其不能杂 大而杂 就把领域抹平了,而很可能这是领域知识

宋:
对,我糊涂了。

白:
其实,和人名结合是兜底的,要学的只是不和人名结合的高频词串。
向右结合的条件不满足,就默认向左好了。
大数据不是这么用的。

宋:
不过无论如何,一般来说,X小组 比不上 小组成绩。这里是领域知识问题,不大好用词频去处理。

我:
先说一下篇章现象 one sense per discourse.
如果同一篇中 还有 某某某小组 再现。那个原则是过硬的 可以 在篇章内搞定,这时候大数据认输。

宋:
张三小组第一,李四小组第二。

白:
@宋柔 这个是歧义

我:
分为四级
第一级 是词典绑架 北京大学基本如此
第二级 是篇章原则
第三级 是领域数据
第四级 才是大数据 超领域的
涉及到专名 术语的 走不到超领域的大数据,大数据抹平了领域知识 反而不妙

白:
词例级如此,特征级未必
特征级可以把xx小组一起拿上来统计。

我:
明白。不过具体操作起来,还是一笔糊涂账。xxx 小组 与 小组成绩 打架,要赢多少 算赢?在多大的数据里?如果特别悬殊 好说,稍微有些接近 就是烂帐,or 烂仗。

白:
另外,针对篇章可以计算特征密度,如果某种特征密度显著比其他特征高,也可用。比如体育特征显著,“小组”做前缀就优先级较高。

宋:
我在11年人民日报中检索,“小组赛”1013次,“小组成绩”4次,“小组赛成绩”两次,人名+小组3次。对于一个毫无体育比赛知识的人,如果有一般的比赛知识,知道比赛会出成绩,就能推知“小组比赛”是一个短语。首先是从黏着的“赛”黏着到“小组赛”,知道有“小组赛”这个术语,并能理解这是分小组而比赛。由于知道比赛会出成绩,就能推知“小组成绩”是一个短语,指某人在小组赛中的成绩。人名+小组7次,但都与体育无关:赵梦桃小组,郝建秀小组等,都是棉纺厂的。一个人,没有体育比赛知识,但有一般的比赛知识,又有语言知识,就可以有这样的推理

我:
“周恩来思想深刻 谈吐幽默”,vs. “毛泽东思想深刻”
“思想” 与 “小组” 类似

宋:
1940年代以前,汉语中好像没有“人名+思想”作为一个词的。此后,“毛泽东思想”频率越来越高。但其他人名+思想就不能成词。

我:
这个政治有意思:从此 其他 人名+思想 成为禁忌:我花开来百花杀啊。

白:
@宋 “小组循环赛”“小组出线”“小组第一”……等各种组合均以“小组”为前缀,如果只对实例,其实比“朴泰恒小组”好不到哪里去。统计频度多一点少一点都做不得结构优选的依据。但是如果抽象地考察“前缀模式”和“后缀模式”的优先程度受什么影响,必然会追溯到特征以及特征在篇章中的密度分布。如果“体育”或“竞赛”特征及其密度优势显著,“小组”倾向于做前缀,否则倾向于做后缀。如果前缀所带的实例碰巧在大数据里固然好,不在,也可通过特征及特征密度间接获得友军的支持。同样,如果“人名”“任务名”特征或特征密度显著,“小组”倾向于做后缀。

 

【相关】

【置顶:立委NLP博文一览】

《朝华午拾》总目录

 

 

【一日一parsing:degraded text and robust parsing】

我:
“i love programming the games are cool its fun to play them don't you think”
@梁 here are parsing results of your casual English:

t0721a

So there is one error in parsing this "degraded text":
our parser links "the games" as Object of "programming" which is locally correct, an understandable mistake. But human knows there is a missing punctuation and will link "the games" as Subject of "are", other aspects of parsing seem alright.  So "degraded text" does pose some challenges, but a robust parser can still handle most of it.

@梁:
Thank you, @wei. It is very well handled. By the way, it is not my casual English. I copied it from Khan Academy.
@wei, ”Opred“ means predicate as objective, what is "infmod"?

白:
不定式作后置修饰语

我:
对。Opred 是谓词性宾语,包括ing和不定式。
其实那个错误 做细活 是可以改正的 因为 are 对主语的强制性力量 远远超越了作为前面动词宾语的力量。这样就达到人的结构分析水平了。

白:
think怎么next了?这个是个反义疑问句啊。

我:
白老师眼毒,不指出我根本就没注意到呢。那显然是一个 bug:助动当成主动词了。
就事论事 那个应该词典化。

白:
are距离又近,不填主语又不饱和。反倒是programming,不是非有坑不可。
词典化赞同。

 

 

 

【相关】

【置顶:立委NLP博文一览】

《朝华午拾》总目录

 

 

成语的弹性识别和理解机制

白:
“去年秋膘应犹在,只是猪颜改”

我:
1234应犹在 只是56改
成语弹性机制一抓一个准。一个成语中 哪些是变量 哪些是常量 可以研究。人心里大体有本帐。拿 “九牛二虎之力” 为例,弹性第一环是数词的变量化:m牛n虎之力

二牛九虎之力
九虎二牛之力
八虎七牛之力
四牛五虎之力

都不影响parsing和理解,总之是 费了老鼻子劲儿。

弹性第二环 是名词沿着taxonomy变量化:m 【大动物】n【大动物】之力

九熊二豹之力
三象五狮之力

转:
今个立秋,问苍天什么季节最忙? 秋天,多事之秋; 什么季节最公平? 秋天,平分秋色; 什么季节最简单? 秋天, 一叶知秋; 什么季节最长? 秋天,一日不见如隔三秋; 什么季节最爽? 秋天,秋高气爽;什么季节最险?秋天,秋后算账: 什么季节最暧昧? 秋天,暗送秋波!秋日快乐!!

成语弹性机制 从 “秋” 上升到 【季节】 再上升可以到 【时段】:

多事之春 多事之年 多事之岁月
平分春色
一花知春
一日不见如隔三冬
一日不见如隔九冬

白:
秋天来了,冬天还会远吗

我:
冬天来了 秋天还会远吗
这是时间隧道
或倒转 或快进。

关于小标题:

0905b

【成语】的【【弹性【识别和理解】】机制】,论句法应该是这样的:对于成语,需要一个弹性的识别机制,或者弹性识别的机制。但写的时候,脑子里更可能想的是,对于【成语的弹性】,需要一个识别机制。

再一想,who cares,人的表达和理解不常常是这样模模糊糊的吗。除了段子或较真,通常人根本就对这类结构歧义无感。语义上的模糊也不影响理解的大面。

【相关】

立委NLP博文一览

《朝华午拾》总目录

立委NLP频道

Once upon a time, we were publishing like crazy

List of 23 NLP Publications (Cymfony Period)

Once upon a time, we were publishing like crazy ...... as if we were striving for tenure faculty

[1] R. Srihari, W. Li and X. Li. 2006. Question Answering Supported by
Multiple Levels of Information Extraction.  a book chapter in T. Strzalkowski & S. Harabagiu (eds.), Advances in Open- Domain Question Answering.  Springer, 2006, ISBN:1-4020-4744-4.

http://link.springer.com/chapter/10.1007%2F978-1-4020-4746-6_11

[2] R. Srihari, W. Li, C. Niu and T. Cornell. 2006.  InfoXtract: A Customizable Intermediate Level Information Extraction Engine.  Journal of Natural Language Engineering, 12(4), 1-37

http://journals.cambridge.org/action/displayAbstract?fromPage=online&aid=1513012

This paper focuses on IE tasks designed to support  information discovery applications. It defines new IE tasks such as entity profiles, and concept-based general events which represent realistic goals in terms of what can be accomplished in the near-term as well as providing useful, actionable information.

[3] C. Niu, W. Li, R. Srihari, H. Li.  2005. Word Independent Context Pair Classification Model For Word Sense Disambiguation.  Proceedings of Ninth Conference on Computational Natural Language Learning (CoNLL-2005)

W05-0605

[4] C. Niu, W. Li and R. Srihari. 2004. Weakly Supervised Learning for
Cross-document Person Name Disambiguation Supported by Information
Extraction. In Proceedings of ACL 2004.

ACL 2004 Niu Li Srihari 372_pdf_2-col

[5] C. Niu, W. Li, R. Srihari, H. Li and L. Christ. 2004. Context Clustering for Word Sense Disambiguation Based on Modeling Pairwise Context Similarities. In Proceedings of Senseval-3 Workshop.

ACL 2004 Context Clustering for WSD niu1

[6] C. Niu, W. Li, J. Ding, and R. Rohini. 2004. Orthographic Case
Restoration Using Supervised Learning Without Manual Annotation.
International Journal of Artificial Intelligence Tools, Vol. 13, No.
1, 2004.

IJAIT 2004 Niu, Li, Ding, and Srihari caseR

(7) Cheng Niu, Wei Li and Rohini Srihari 2004. A Bootstrapping
Approach to Information Extraction Domain Porting. ATEM-2004: The
AAAI-04 Workshop on Adaptive Text Extraction and Mining. San Jose. (PDF)

[8] W. Li, X. Zhang, C. Niu, Y. Jiang, and R. Srihari. 2003. An Expert
Lexicon Approach to Identifying English Phrasal Verbs. In Proceedings
of ACL 2003. Sapporo, Japan. pp. 513-520.

ACL 2003 Li, Zhang, Niu, Jiang and Srihari 2003 PhrasalVerb_ACL2003_submitted

[9] C. Niu, W. Li, J. Ding, and R. Srihari 2003. A Bootstrapping
Approach to Named Entity Classification using Successive Learners. In
Proceedings of ACL 2003. Sapporo, Japan. pp. 335-342.

ACL 2003 Niu, Li, Ding and Srihari 2003 ne-acl2003

[10] W. Li, R. Srihari, C. Niu, and X. Li. 2003. Question Answering on
a Case Insensitive Corpus. In Proceedings of Workshop on Multilingual
Summarization and Question Answering - Machine Learning and Beyond
(ACL-2003 Workshop). Sapporo, Japan. pp. 84-93.

ACL 2003 Workshop Li, Srihari, Niu and Li 2003 QA-workshopl2003_final

[11] C. Niu, W. Li, J. Ding, and R.K. Srihari. 2003. Bootstrapping for
Named Entity Tagging using Concept-based Seeds. In Proceedings of
HLT/NAACL 2003. Companion Volume, pp. 73-75, Edmonton, Canada.

NAACL 2003 Niu, Li, Ding and Srihari 2003 ne_submitted

[12] R. Srihari, W. Li, C. Niu and T. Cornell. 2003. InfoXtract: A
Customizable Intermediate Level Information Extraction Engine. In
Proceedings of HLT/NAACL 2003 Workshop on Software Engineering and
Architecture of Language Technology Systems (SEALTS). pp. 52-59,
Edmonton, Canada.

NAACL 2003 Workshop InfoXtract SEALTS paper2

[13] H. Li, R. Srihari, C. Niu, and W. Li. 2003. InfoXtract Locatio
Normalization: A Hybrid Approach to Geographic References in
Information Extraction. In Proceedings of HLT/NAACL 2003 Workshop on
Analysis of Geographic References. Edmonton, Canada.

NAACL 2003 Workshop Li, Srihari, Niu and Li 2003 CymfonyLoc_final

[14] W. Li, R. Srihari, C. Niu, and X. Li 2003. Entity Profile
Extraction from Large Corpora. In Proceedings of Pacific Association
for Computational Linguistics 2003 (PACLING03). Halifax, Nova Scotia,
Canada.

PACLING 2003 Li, Srihari, Niu and Li 2003 Entity Profile profile_PACLING_final_submitted

[15] C. Niu, W. Li, R. Srihari, and L. Crist 2003. Bootstrapping a
Hidden Markov Model for Relationship Extraction Using Multi-level
Contexts. In Proceedings of Pacific Association for Computational
Linguistics 2003 (PACLING03). Halifax, Nova Scotia, Canada.

PACLING 2003 Niu, Li, Srihari and Crist 2003 CE Bootstrapping PACLING03_15_final

[16] C. Niu, Z. Zheng, R. Srihari, H. Li, and W. Li 2003. Unsupervised
Learning for Verb Sense Disambiguation Using Both Trigger Words and
Parsing Relations. In Proceedings of Pacific Association for
Computational Linguistics 2003 (PACLING03). Halifax, Nova Scotia,
Canada.

PACLING 2003 Niu, Zheng, Srihari, Li and Li 2003 Verb Sense Identification PACLING_14_final

[17] C. Niu, W. Li, J. Ding, and R.K. Srihari 2003. Orthographic Case
Restoration Using Supervised Learning Without Manual Annotation. In
Proceedings of the Sixteenth International FLAIRS Conference, St.
Augustine, FL, May 2003, pp. 402-406.

FLAIRS 2003 Niu, Li, Ding and Srihari 2003 FLAIRS03CNiu

[18] R. Srihari  and W. Li 2003. Rapid Domain Porting of an
Intermediate Level Information Extraction Engine. In Proceedings of
International Conference on Natural Language Processing 2003.

ICON2003 paper FINAL

[19] H. Li, R. Srihari, C. Niu and W. Li 2002. Location Normalization
for Information Extraction. In Proceedings of the 19th International
Conference on Computational Linguistics (COLING-2002). Taipei, Taiwan.

COLING 2002 Li, Srihari, Niu and Li 2002 coling2002LocNZ

[20] W. Li, R. Srihari, X. Li, M. Srikanth, X. Zhang and C. Niu 2002.
Extracting Exact Answers to Questions Based on Structural Links. In
Proceedings of Multilingual Summarization and Question Answering
(COLING-2002 Workshop). Taipei, Taiwan.

COLING 2002 Workshop Li et al CymfonyQA_final

[21] R. Srihari, and W. Li. 2000. A Question Answering System
Supported by Information Extraction. In Proceedings of ANLP 2000.
Seattle.

ANLP 2000 Srihari and Li 2000 anlp9l

[22] R. Srihari, C. Niu and W. Li. 2000. A Hybrid Approach for Named
Entity and Sub-Type Tagging. In Proceedings of ANLP 2000. Seattle.

ANLP 2000 Srihari, Niu and Li 2000 anlp105_final9

[23] R. Srihari and W. Li. 1999. Question Answering Supported by
Information Extraction. In Proceedings of TREC-8. Washington

cymfony

Other publications: SBIR Final Reports

W. Li & R. Srihari. 2003.  Flexible Information Extraction Learning Algorithm (Phase 2), Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York. 


W. Li & R. Srihari. 2001.  Intermediate-Level Event Extraction for Temporal and Spatial Analysis and Visualization (Phase 1), Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York.

W. Li & R. Srihari.  2000.  A Domain Independent Event Extraction Toolkit (Phase 2), Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York.

W. Li & R. Srihari. 2000.  Flexible Information Extraction Learning Algorithm (Phase 1), Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York.

W. Li & R. Srihari 2003. Automated Verb Sense Identification (Phase I), Final Techinical Report, U.S. DoD SBIR (Navy), Contract No. N00178-02-C-3073 (2002-2003)

R. Srihari & W. Li 2003. Fusion of Information from Diverse, Textual Media: A Case Restoration Approach (Phase I), Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York. Contract No. F30602-02-C-0156 (2002-2003)

R. Srihari, W. Li & C. Niu 2004. A Large Scale Knowledge Repository and Information Discovery Portal Derived from Information Extraction (Phase 1), Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York. (2003-2004)

R. Srihari & W. Li 2003. An Automated Domain Porting Toolkit for Information Extraction (Phase I), Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York. Contract No. F30602-02-C-0057 (2002-2003)

T. Cornell, R. Srihari & W. Li 2004. Automatically Time Stamping Events in Unrestricted Text (Phase I), Final Technical Report, Air Force Research Laboratory, Information Directorate, Rome Research Site, New York. (2003-2004)

 

[Related]

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

【一日一parsing:谈parsing是问答系统的核武】

一日一parsing:今天的是。。。

0831d

怎么知道这里的问题和答案可以相配呢?如果有 parsing 和建立其上的知识图谱,那就好办。图谱里面有 professionOf 的 relationship,有了 parsing 抽取这个关系就是小菜(这个例子很简单,就是把同位语关系映射到professionOf关系)。有了 parsing 对于 question 要问的关系,也可以解出来 asking point,子树(S:李娜-从事,O:从事-运动;Mod:什么-关系)就确定了 asking point 是寻求 professionOf(“李娜”)。然后做语义 matching,问答系统的这个环就圆了。This is IE or knowledge-graph supported QA.

具体说,为了让Q和A能match,我们可以对两边做子树规则,填空(抽取)到 professionOf 的关系去,语义一体化,然后就顺风顺水了。第一条子树规则是:

"从事"O: (“职业|运动”)

O: (“职业|运动”)

Mod (“什么|何种”)

S: ^Sombody==>

==> professionOf(^Somebody,?)

professionOf(^Somebody,?)

这是 Question parsing 和 asking point extraction.  在答案源那一边,也有一组规则做 professionOf 的抽取,其中有这样一条规则:[personNE]

[person-NE]:^Person

equiv([profession_token]:^Profession)

==> professionOf(^Person,^Profession)

QA 就这样 match 了。

如果没有专门的知识图谱,没有事先定义好的关系的抽取,怎样做 QA 来应对呢?那就用 SVO parsing 也可以应对相当多的关于事件的问答。但是关系和复杂的事件的问答,简单的 SVO matching 就不行。好在原则上说,复杂的语义大多可以预先定义成 IE (predefined), 专门去做针对性抽取。简单的语义是 open-ended 的,语言学parsing(主谓宾定状补等)就够应付了。

天不我欺也。

IE 对于 SVO,实质就是 (semantic) slot normalization,原来的 slots 是语言学的,叫 S 也好, O 也好,equiv(同位语)也好,mod 也好 。。。。现在的 slots 是 pragmatic 的语义: 譬如 professionOf, locationOf, employeeOf, acquiringCompany, acquiredCompany, priceOfAcqusition, etc.

SVO matching 的 QA 也可以举一个例子, 譬如询问如何做某事:做+某事 就是一个 V+O:

0831a

0831b

0831c

甭管怎样换说法,不变的是 VO (格式化,硬盘)。有了这个 VO matching 做底,离开QA 或人机对话就不远了。譬如,FAQ 档案里很可能就有这样的标题: 格式化硬盘的步骤;关于格式化硬盘;等。于是 Q与A基本就是 SVO 子树 matching:"格式化“ ---O---> “硬盘”。
0901b

接着这个话题再发挥一下。IE 说的是信息抽取,多数时候这个 information 是与 insights (情报,有价值的信息)等价。但其实 IE 可以是抽取有价值的情报,也可以是抽取无价值的情报(噪音)。

为啥要抽取无价值的信息呢?道理很简单,噪音捣乱啊,为了剔除噪音,首先要识别它,或者说抽取它以便扔掉它。所用的方法可以完全一样。搜索界有 stop words ,被当做噪音扔掉了,那是噪音的最简单形式,不需要上下文,纯粹是高频虚词:对于 parsing 这些 stop words 其实很关键,是必要的建立结构的桥梁,但对于关键词搜索,因为里面没有结构,这些词就变成纯粹的噪音了。用 IE 来剔除噪音,实际上是根据上下文结构来断定哪些信息是应该扔掉的,譬如上面的句子里面,在 QA 的语用场景下,就可以剔除诸如:“请告诉我”、“我不知道”等,这样才凸显关键的的VO“格式化-硬盘”。要是做相似度计算,这些个词都是噪音。把“请告诉我”当成一个 4-gram 的 stop word 行不行?可以,但是如果这种东西有很多变式,ngram 就不行了。这时候在子树基础上做 IE 抽取噪音就非常可取了。又因为噪音大多可以用 word-driven 来做,做这件事儿是很靠谱的,基本一抓一准。

小结一下,一般而言,如果 Q 和 A 说法类似,譬如“格式化”+ “硬盘”,那么只要在 SVO 基础上做 matching 就可以把 QA couple 起来。如果 说法很不相同,或者一个关系或事件的变式太多,那么就加一层 IE,matching 在 IE 语义上做。SVO 的 QA matching 是智能搜索的本质,可以对付不可预测的问题。IE 的 QA matching 是预先定义的,针对领域的,不仅精准,而且可以应对变式。两个方案相辅相成。一个善于领域的精准,一个善于open domain 的广度和召回。二者都比 keywords 好出很多,因为有结构。如果从 backoff 来看,那就是 IE 优先, SVO 其次,keywords 楼底。这样精度广度就全照顾到了。

说来归齐,对于QA,对于对话系统,parsing 是核心引擎的关键技术。QA 说到底就是在 Q 与 A 中建立映射,映射的基础是语义匹配。deep parsing 及其 IE 是语义匹配的核武。

 

【相关】

【Bots 的愿景】

立委科普:问答系统的前生今世

泥沙龙笔记:parsing 是引擎的核武器,再论NLP与搜索

泥沙龙笔记:从 sparse data 再论parsing乃是NLP应用的核武器

【立委科普:NLP核武器的奥秘】

问答系统

泥沙龙笔记:搜索和知识图谱的话题

置顶:立委NLP博文一览】

《朝华午拾》总目录

立委NLP频道

立委硕士论文:EChA 试验结果 (11)

世界语到汉语和英语的自动翻译试验
-- EChA机器翻译系统概述

 [参考书目]

 

  1. Heinz Dieter MAAS "Automata  Tradukado en kaj el Esperanto" ( "Lingvo-kibernetiko kaj aliajinternacilingvaj aktoj de l(1a IX-a Internacia Kongreso de Kibernetiko", pp 75-81, 1982 Gunter Narr Verlag Tubingen )
  1. <<机器翻译论文选辑>> ( 科学技术文献出版社, 1979 )
  2. Kalocsay-Waringhien <<Plena Analiza Gramatiko de Esperanto>> ( 中国世界语出版社, 1984 )
  3. 刘涌泉等著 <<中国的机器翻译>> ( 知识出版社, 1984 )
  4. 刘涌泉, 高祖舜, 刘倬著 <<机器翻译浅说>> ( 科学普及出版社, 1964 )
  5. 刘涌泉, 李维 <<巴贝尔通天塔必将建成>> ( 中国第一届世界语大会论文, 1985.8 )
  6. 刘倬 <<三次机器翻译试验>> ( 第一次机器翻译学术会议论文, 1980.9 )
    <<论机器翻译规则系统的编制方法>> ( 1982.3 上海 )
    <<JFY型英汉机器翻译系统的研制和试验>> ( 语言学会第二届年会论文, 1983.4 )
  1. 乔毅 <<开展语言的计算机处理和世界语类型的机器翻译>> ( 中国第一届世界语大会论文, 1985.8 )
  2. 魏原枢, 徐文琪编 <<世界语语法>> ( 上海外语教育出版社, 1982 )
  3. 叶蜚声, 徐通锵著 <<语言学纲要>> ( 北京大学出版社, 1981 )
  4. <<语言和计算机>> (1) (中国社会科学出版社, 1982 )
  5. <<语言和计算机>> (2) (中国社会科学出版社, 1985 )
  6. 张道真编著 <<实用英语语法>> ( 商务印书馆, 1984 )

[致谢]

研制世界语类型的机器翻译系统, 从一开始就得到刘涌泉老师的热情支持, 从方案主体到具体问题的处理, 他都给以认真指导。在程序设计和上机调试的的过程中, 刘倬老师也多次给予指导, 有些基本操作的算法也是刘倬老师提供的。在EChA系统取得初步成果的时候, 笔者向他们表示深切的感谢。另外, 还要特别感谢机房韩老师的多方协助。没有她提供的方便, EChA系统根本不可能在这么短时间试验成功。

[附录一] EChA试验结果

 

(1) LA ORIGINALA TEKSTO / THE ORIGINAL TEXT / 世界语原文

(001) TIEL EVOLUIGHIS PLI KAJ PLI LA PLANADO PER MASHINOJ . (002) TIUJ MASHINOJ KOMENCE NUR ELKALKULIS LA DIKTITAJN MATEMATIKAJN PROBLEMOJN , KONFORME AL LA ENPROGRAMIGO . (003) LA ELEKTRONIKAN PROGRAMIGON PRETIGIS HOMOJ . (004) PLI POSTE , KIAM LA SCIODISKETOJ ESTIS ELTROVITAJ , LA PLENAN INDIKARON , ENDISKIGITAN , ONI METIS EN MASHINOJN KAJ ILI TIAMANIERE POVIS EN SI MEM AKUMULI SCIENCAN STOKON , PLI GRANDAN OL LA HOMA CERBO. (005) KAJ SE TEMIS EKZEMPLE PRI LA PLANADO DE ELEKTROMOTORO , ONI ENMETIS LA SHABLONDISKETON DE LA ELEKTROMOTOR-PLANADO , DONIS LA INDIKOJN DE LA DEZIRATA MOTORO ( KILOVATO , TENSIO , ROTACIO , TIPO , KTP ) , (006) POST KIO LA MASHINO MEM PROGRAMIGIS SIN KAJ FARIS LA KALKULOJN . POST KELKAJ MINUTOJ GHI JAM PRETE ELDONIS LA MEZUROJN : LA DIAMETRON DE LA ROTACIA PARTO , GHIAN LONGON, LA MEZUROJN DE LA KANELOJ , DRATOJ , LA VOLVONOMBRON , ENTUTE CHION BEZONATAN . (007) ECH PLI : BALDAU ESTIS ATINGITE , KE LA MASHINO FARIS LA TUTAN DESEGNON KAJ TRANSDONIS GHIN AL LA FABRIKO . (008) KOMPRENEBLE TIUJ < DESEGNOJ > NE ESTIS IDENTAJ KUN NIAJ PAPERDESEGNOJ . (009) ILI ESTIS DISKETOJ , KIUJ ENTENIS CHIUN DETALON . (010) TIAMANIERE LA PLANADON KAJ FABRIKADON DE LA MASHINOJ JAM PLENUMIS SAME MASHINOJ . (011) ILI PLANIS LA MENDITAN MASHINON , FABRIKIS , ECH KONTROLPROVIS GHIN KAJ LA FUSHAN FORJHETIS . (012) SED CHIO CHI ANKORAU OKAZIS SUB HOMA GVIDADO KAJ PLEJ GRAVE ESTIS , KE CHIO CHI BAZIGHIS SUR LA HOMA SCIO .

LA TEKSTO TRADUKITA EN LA ANGLAN / THE TEXT TRANSLATED INTO ENGLISH / 英语译文

(001) SO DEVELOPED MORE AND MORE THE PLANNING BY MACHINES . (002) THOSE MACHINES AT BEGINNING ONLY CALCULATED OUT THE DICTATED MATHEMATICAL PROBLEMS , ACCORDING TO THE PROGRAMMING . (003) MEN PREPARED THE ELECTRONIC PROGRAMMING . (004) MORE LATER , WHEN THE KNOWLEDGE-DISKETTES HAD BEEN FOUND OUT , PEOPLE PUT THE FULL INDICATION , ENDISKED , INTO MACHINES AND THEY THEREFORE COULD IN THEMSELVES ACCUMULATE SCIENTIFIC STOCK, MORE GREAT THAN THE MAN'SBRAIN . (005) AND IF IT CONCERNED FOR EXAMPLE ABOUT THE PLANNING OF ELECTRIC MOTOR, PEOPLE INPUT THE SAMPLE DISKETTE OF THE MOTOR PLANNING , GAVE THE INDICATIONS OF THE DESIRED MOTOR (KILOWATT , VOLTAGE , ROTATION , TYPE , ETC ) , AFTER WHICH THE MACHINE ITSELF PROGRAMMED ITSELF AND DID THE CALCULATIONS . (006) AFTER SEVERAL MINUTES IT ALREADY READILY GAVE OUT THE MEASUREMENTS : THE DIAMETER OF THE ROTARY PART ,ITS LENGTH , THE MEASUREMENTS OF THE GROOVES , WIRES , THE WINDING NUMBER , IN TOTAL ALL REQUIRED . (007) EVEN MORE : SOON IT HAD BEEN ACHIEVED , THAT THE MACHINE DID THE TOTAL DESIGN AND OVERHANDED IT TO THE FACTORY . (008) OF COURSE THOSE < DESIGNS >  WERE NOT IDENTICAL WITH OUR PAPERDESIGNS . (009) THEY WERE DISKETTES , WHICH CARRIED ALL DETAIL . (010) THEREFORE MACHINES ALREADY FULFILED THE PLANNING AND MANUFACTURING OF THE MACHINES SAMELY . (011) THEY PLANNED THE ORDERED MACHINE , MANUFACTURED , EVEN EXAMINED IT AND THREW AWAY THE USELESS . (012) BUT ALL THIS STILL HAPPENED UNDER MAN'S GUIDING AND IT WAS MOST IMPORTANT , THAT ALL THIS WAS BASED ON THE MAN'S KNOWLEDGE .

LA TEKSTO TRADUKITA EN LA CHINAN / THE TEXT TRANSLATED INTO CHINESE / 汉语译文

(001) 这样用机器设计越来越发展了. (002) 那些机器开始时仅仅按照输入程序计算出所命令的数学问题. (003) 人准备了电子程序设计. (004) 更以后,当微型知识磁盘被发明了时,人们把所写入磁盘的全套指令集合放到机器里面,他(它)们这样能在自己本身里面积累比人的头脑更大的科学贮蓄. (005) 如果涉及例如关于电动机的设计, 人们输入了电动机设计的微型样品磁盘, 给了所希望的电动机的指标(千瓦,电压,运转,型号,等等),在此以后机器本身把自己程序化了,做了计算. (006) 在几分钟以后它已经就能给出尺寸:运转部分的直径,它的长度,槽纹,导线的尺寸,圈数,总之所需要的一切. (007) 甚至更:很快达到了,机器做了整个图样,把它转交到工厂. (008) 当然那些<图样>与我们的图纸不是一样的. (009) 他(它)们是储有所有细节的微型磁盘. (010) 这样机器已经同样地完成了机器的设计和制造. (011) 他(它)们设计了所定购的机器,制造了,甚至检验了它,把废的抛弃了. (012) 但是这一切仍然在人的指导下进行,最重要的是,这一切以人的知识作为基础.

(2) DIVERSAJ FRAZOJ / VARIOUS SENTENCES / 各类文句

(016) KIAM MI ESTIS LUDANTA VIOLONON , MIA ONKLO VIZITIS NIAN HEJMON .
WHEN I WAS PLAYING VIOLIN , MY UNCLE VISITED OUR HOME .
当我(当时)正在拉小提琴时,我的叔叔访问了我的家.

(020)  MI ESTOS FININTA LA EKSPERIMENTON PRI MASHINA TRADUKADO POST KELKAJ MONATOJ .
I WILL HAVE FINISHED THE EXPERIMENT ABOUT MACHINE'S TRANSLATING IN SEVERAL MONTHS.
我在几月以后将已经完成关于机器的翻译的实验.

(028)  BABELO NE ESTIS ELKONSTRUITA.
BABEL HAD NOT BEEN BUILT UP .
巴贝尔塔没有被建成.

(029)  NEPRE ESTOS ELKONSTRUITA LA NOVA BABELO .
ABSOLUTELY WILL HAVE BEEN BUILT UP THE NEW BABEL .
新巴贝尔塔必然地将被建成.

(040)  KIAL VI LERNAS ESPERANTON ?
WHY DO YOU LEARN ESPERANTO ?
为什么你学习世界语?

(044)  NE PROKRASTU LA HODIAUAN LABORON GHIS MORGAU .
DON'T PUT OFF THE TODAY'S WORK TILL TOMORROW .
别把今天的工作推迟到明天.

(045)  KIEL BONE PENTRAS LA KNABO !
HOW WELL THE BOY PAINTS !
男孩多么好地画画啊!

(048)  KIU ESTAS LA AUTORO DE LA LIBRO , KIUN VI JHUS LEGIS ?
WHO IS THE AUTHOR OF THE BOOK , WHICH YOU JUST READ ?
你刚刚读了的书的作者是谁?

(050)  SE MI PARTOPRENUS EN VIA AMUZA AKTIVADO , MI ESTUS TRE GHOJA .
IF I WOULD TAKE PART IN YOUR RECREATIONAL ACTIVITY , I WOULD BE VERY GLAD .
如果我参加你(们)的文娱活动,我会是很高兴的.

(056)  CHU VI MEMORAS LA TAGOJN , KIAM NI KUNE STUDIS EN LA UNIVERSITATO ?
DO YOU REMEMBER THE DAYS , WHEN WE TOGETHER STUDIED IN THE UNIVERSITY ?
你记得我们在一起在大学里面学习的日子吗?

(058)  UNUIGHU PROLETOJ DE CHIUJ LANDOJ !
LET PROLETARIANS OF ALL COUNTRIES UNITE !
让所有国家的无产者联合吧!

(061)  KIEL SAGHA VI ESTAS !
HOW WISE YOU ARE !
你是多么聪明啊!

(062)  ESPERANTO ESTAS INTERNACIA HELPA LINGVO .
ESPERANTO IS INTERNATIONAL HELP LANGUAGE .
世界语是国际辅助语言.

(067)  LIA PROPONO ESTAS , KE NI CHIUJ LIBERE ELMETU NIAJN OPINIOJN .
HIS PROPOSAL IS , THAT WE ALL FREELY OUTPUT OUR OPINIONS .
他的建议是,让我们所有人自由地提出我们的意见.

(068)  MI NE SCIAS , KIAM KOMENCIGHOS NIAJ FERIOJ .
I DON'T KNOW , WHEN WILL BEGIN OUR HOLIDAYS .
我不知道,我们的假日什么时候将开始.

(069)  LA LIBRO , KIU KUSHAS SUR LA TABLO , ESTAS VERDA .
THE BOOK , WHICH LIES ON THE TABLE , IS GREEN .
在桌子上躺的书是绿的.

(071)  LA INFANO PLORAS , CHAR IU LIN BATIS .
THE CHILD CRIES , BECAUSE SOMEBODY BEAT HIM .
小孩哭,因为某人打了他.

(078)  LERNI ESPERANTON NE ESTAS MALFACILE .
TO LEARN ESPERANTO IS NOT DIFFICULT .
学习世界语不是困难的.

(084)  MI NE SCIAS , CHU VI POVAS PLENUMI TIUN CHI TASKON .
I DON'T KNOW , WHETHER YOU CAN FULFIL THIS TASK .
我不知道,是否你能完成这个任务.

(086)  MULTAJ DIVERSLANDAJ ESPERANTISTOJ CHEESTOS LA UNIVERSALAN KONGRESON DE ESPERANTO OKAZONTAN PEKINE .
A LOT OF VARIOUS COUNTRY'S ESPERANTISTS WILL ATTEND THE UNIVERSAL CONGRESS OF ESPERANTO TO BE HELD IN BEIJING .
许多不同国家的世界语者将参加在北京将召开的世界语的国际大会.

(089)  LIA PROPONO ELEKTI NOVAN PREZIDANTON NE ESTIS AKCEPTITA .
HIS PROPOSAL TO ELECT NEW PRESIDENT HAD NOT BEEN ACCEPTED .
他的选举新总统的建议没有被接受.

(090)  SHI ESTAS LA PLEJ BELA EL LA KNABINOJ .
SHE IS THE MOST BEAUTIFUL OF THE GIRLS .
她在女孩里面是最漂亮的.

(092)  FALINTE , LI NE POVIS RELEVIGHI .
HAVING FALLEN , HE COULD NOT GET UP .
摔倒了,他不能重新起来.

(093)  FORIRONTE , LI PREMIS MIAN MANON .
TO GO AWAY , HE SHOOK MY HAND .
将要离去,他握了我的手.

(098)  MI TRE AMAS ESPERANTON , MI PLI AMAS ESPERANTISTOJN , MI PLEJ AMAS LA IDEALON DE ESPERANTO .
I VERY MUCH LOVE ESPERANTO , I MORE LOVE ESPERANTISTS , I MOST LOVE THE IDEAL OF ESPERANTO .
我很爱世界语,我更爱世界语者,我最爱世界语的理想.

(116)  NI LUDU , CHU BONE ?
LET'S PLAY , ALL RIGHT ?
让我们玩吧,好吗?

(119)  KIA MIRAKLO TIO ESTAS , KE NIAJ ANTIKVULOJ KONSTRUIS LA GRANDAN MURON NUR PER SIAJ DU MANOJ !
WHAT MIRACLE IT IS , THAT OUR ANCESTORS BUILT THE GREAT WALL ONLY BY THEIR TWO HANDS !
我们的祖先仅仅用自己的两手建造了长城,这是怎样的奇迹啊!

(121)  FORPASIS UNU TAGO , FORPASIS ANKAU LA DUA .
PASSED AWAY ONE DAY , PASSED AWAY ALSO THE SECOND .
一天过去了,第二也过去了.

(122)  CHU ESTAS EBLE , KE VI NENION SCIAS ?
IS IT POSSIBLE , THAT YOU KNOW NOTHING ?
你不知道任何事,这是可能的吗?

(131)  LA HOMON , PRI KIU VI PAROLAS , MI NENIAM VIDIS .
I NEVER SAW THE MAN , ABOUT WHOM YOU SPEAK .
我从未看见过你提到的人.

(132)  NI , ESPERANTISTOJ , DEVAS LABORI PLI ENERGIE OL IAM .
WE , ESPERANTISTS , MUST WORK MORE HARD THAN EVER .
我们,世界语者,应该比任何时候更努力工作.

(133)  SOMERE ESTAS TRE VARME .
IN SUMMER IT IS VERY HOT .
夏天是很热的.

(134)  DOKTORO ZAMENHOF NASKIGHIS LA 15-AN DE DECEMBRO EN 1859 .
DOCTOR ZAMENHOF WAS BORN ON THE 15TH OF DECEMBER IN 1859 .
柴门霍夫博士1859年十二月的15号出生.

(135)  SE VI SCIUS , KIU LI ESTAS , VI LIN PLI ESTIMUS .
IF YOU WOULD KNOW , WHO HE IS , YOU MORE WOULD ESTEEM HIM .
如果你知道,他是谁,你更会尊敬他.

(136)  CENTOJ DA MALFERMAJ AUTOJ NIN PORTIS AL LA CENTRA LENIN-STADIONO, MALRAPIDE MOVIGHANTE TRA LA HOMA SVARMO .
HUNDREDS OF OPEN CARS CARRIED US TO THE CENTRAL LENIN STADIUM , SLOWLY MOVING THROUGH THE MAN'S SWARM .
成百敞篷汽车把我们带到中央列宁运动场,缓慢地通过人群运动.

(137)  MI VIDIS , KE LI FALIS KAJ LIA VESTO MALPURIGHIS .
I SAW , THAT HE FELL AND HIS CLOTHES BECAME DIRTY .
我看见了,他摔倒了,他的衣服弄脏了.

(139)  MI SCIIS , KE LI NE FAROS , KION LI PROMESIS .
I KNEW , THAT HE WOULD NOT DO WHAT HE PROMISED .
我知道,他将不做他允诺的.

(140)  ESTAS PAULO , KIU ARANGHIS LA AFERON .
IT IS PAULO THAT ARRANGED THE AFFAIR .
是PAULO安排了事情.

(142)  KUREGIS LA KNABO PER SIA TUTA FORTO , SED LI NE POVIS ATINGI LA PAPILION .
RAN THE BOY BY HIS TOTAL STRENGTH , BUT HE COULD NOT ACHIEVE THE BUTTERFLY .
男孩用自己的整个力量狂奔,但是他不能达到蝴蝶.

(144)  LI DONIS AL MI MULTAJN INSTRUAJN LIBROJN .
HE GAVE ME A LOT OF TEACHING BOOKS .
他给了我许多教科书.

(145)  CHU VI PAROLAS CHINE AU JAPANE ?
DO YOU SPEAK IN CHINESE OR IN JAPANESE ?
你用中文还是用日文说话?

(151)  NUR TIU NE ERARAS , KIU NENIAM ION FARAS .
ONLY THAT PERSON IS NOT WRONG , WHO NEVER DOES SOMETHING .
仅仅从不做某事的那个人不犯错误.

(155)  ESPERANTO ESTAS CHIES PROPRAJHO .
ESPERANTO IS EVERYBODY'S PROPERTY .
世界语是所有人的财产.

(156)  MI MEMORAS CHIUN , KIUN MI VIDIS .
I REMEMBER ALL , WHOM I SAW .
我记得我看见了的所有人.

(157)  ESTAS NENIU EN LA CHAMBRO .
THERE IS NOBODY IN THE ROOM .
在房间里面没有任何人.

(3) DU POEMOJ / TWO POEMS / 两首诗歌

(099) LA ESPERO : ESPERANTISTA HIMNO ( POEMO FAR ZAMENHOF ) .

(100) EN LA MONDON VENIS NOVA SENTO ,
TRA LA MONDO IRAS FORTA VOKO ;
(101) PER FLUGILOJ DE FACILA VENTO ,
NUN DE LOKO FLUGU GHI AL LOKO .

(102) NE AL GLAVO SANGONSOIFANTA ,
GHI LA HOMAN TIRAS FAMILION ;
(103) AL LA MOND' ETERNE MILITANTA ,
GHI PROMESAS SANKTAN HARMONION .
(099) THE HOPE : ESPERANTIST'S HYMN ( POEM BY ZAMENHOF ) .

(100) INTO THE WORLD CAME NEW FEELING ,
OVER THE WORLD GOES STRONG VOICE ;
(101) BY WINGS OF EASY WIND ,
NOW FROM PLACE LET IT FLY TO PLACE .
(102) NOT TO SWORD BLOODTHIRSTY ,
IT PULLS THE MAN FAMILY ;
(103) TO THE WORLD EVER FIGHTING ,
IT PROMISES SACRED HARMONY .

(099) 希望: 世界语者的颂歌 (柴门霍夫所作的诗歌).

(100) 新感觉来到了世界,
有力的声音走遍世界;
(101) 用顺风的翅膀,
现在让它从一个地方飞到另一个地方吧.

(102) 它不把人的家庭
引到渴血的刀剑;
(103) 向永远战争着的世界,
它允诺神圣的和谐.

(104) AL NIA KARA LINGVO ( FAR IU NOVA ESPERANTISTO ) .

(105) LA LINGVO GRACIA , KARA MIA ,
GHIS KIAM VI VENIS AL MI FINE FIN ?
(106) ATENDIS SOIFE MI , ETERNE VIA ,
MI AMAS VIN !

(107) MI AMAS VIN VERE , PRUVU DIO ,
KAJ MIA BON-KORO BATAS NUR POR VI ;
(108) NE PLU SEKRETETO ESTAS TIO :
VIN AMAS MI !

(109) CHU KREDAS VI MIAN AMON MARAN ?
(110) CHU KREDAS , KE MIA KORO FLAMAS ?
(111) CHU KREDAS LA VORTON PURE KARAN :
VIN MI AMAS !

(104) TO OUR DEAR LANGUAGE ( BY SOME NEW ESPERANTIST ) .

(105) THE LANGUAGE GRACEFUL , MY DEAR ,
TILL WHEN YOU CAME TO ME AT LAST ?
(106) WAITED LONGINGLY I , EVER YOURS ,
I LOVE YOU !

(107) I LOVE YOU TRUELY , LET GOD PROVE ,
AND MY GOOD HEART BEATS ONLY FOR YOU ;
(108) NO LONGER THAT IS LITTLE SECRET :
I LOVE YOU !

(109) DO YOU BELIEVE MY LOVE LIKE SEA ?
(110) DO BELIEVE , THAT MY HEART BURNS ?
(111) DO BELIEVE THE WORD PURELY DEAR :
I LOVE YOU !

(104) 献给我们的亲爱的语言(某新世界语者所作).

(105) 优美的语言,我的亲爱的,
到什么时候你最后来到了我这儿?
(106) 我渴望地等待,你的永远的,
我爱你!

(107) 我真实地爱你,让上帝证明吧,
我的善良的心仅仅为了你跳动;
(108) 那已经不再是小秘密:
我爱你!

(109) 你相信我的大海一样的爱吗?
(110) 相信,我的心燃烧吗?
(111) 相信纯粹地亲爱的词吗:
我爱你!

 

 

【相关】

硕士论文: 世界语到汉语和英语的自动翻译试验
立委硕士论文:1. EChA概况
立委硕士论文:2. 世界语: 语言学特点及其研究价值
立委硕士论文:3. 层次递归成分体系
立委硕士论文:4. EChA机器词典及词表
立委硕士论文:5. 世界语形态分析
立委硕士论文:6/7 世界语句法分析
立委硕士论文:8. 英语形态生成
立委硕士论文:9. 目标语调序
立委硕士论文:10. EChA 试验结果的分析
立委硕士论文【致谢】【参考书目】
立委硕士论文全文(世界语版)

《朝华午拾:shijie-师弟轶事(3)——疯狂世界语 》

灵感有如神授,巧夺岂止天工

《立委随笔:一小时学会世界语语法》

立委世界语文章 (1987): 《中国报道:通天塔必将建成》

立委世界语论文(1986): 《国际语到汉语和英语的自动翻译》

立委(1988)《世界科技:世界语到汉语和英语的自动翻译试验》

DLT项目背景介绍

立委硕士论文全文(世界语版)

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

【关于机器翻译】

【置顶:立委NLP博文一览】

《朝华午拾》总目录

Outline of an HPSG-style Chinese reversible grammar

 Outline of an HPSG-style Chinese reversible grammar*

Wei  LI
Simon Fraser University
(NLWC97)

This paper presents the outline and the design philosophy of a lexicalized Chinese unification grammar named W‑CPSG. W‑CPSG covers Chinese morphology, Chinese syntax and semantics in a novel integrated language model. The grammar works reversibly, suited for both parsing and generation. This work is developed in the general spirit of the linguistic theory Head-driven Phrase Structure Grammar (Pollard & Sag 1994). We identify the following two problems as major obstacles in formulating a precise and efficient Chinese grammar. First, we lack in serious study on Chinese lexical base and often jump too soon for linguistic generalization. Second, there is a lack of effective interaction and adequate interface between morphology, syntax and semantics. We address these problems in depth with the lexicalized and integrated design of W‑CPSG. We will also illustrate how W‑CPSG is formalized and how it works.

 

  1. Background

Unification grammars have been extensively studied in the last decade (Shieber 1986). Implementations of such grammars for English are being used in a wide variety of applications. Attempts also have been made to write Chinese unification grammars (Huang 1986, among others). W‑CPSG (for Wei's Chinese Phrase Structure Grammar, Li, W. 1997b) is a new endeavor in this direction, with its unique design and characteristics.

1.1. Design philosophy

We identify the following two problems as major obstacles in formulating a precise and efficient Chinese grammar. First, we lack in serious study on Chinese lexical base and often jump too soon for linguistic generalization. Second, there is a lack of effective interaction and adequate interface between morphology, syntax and semantics. We address these problems in depth with the lexicalized and integrated design of W‑CPSG.

1.1.1. Lexicalized design

It has been widely accepted that a well-designed lexicon is crucial for a successful grammar, especially for a natural language computational system. But Chinese linguistics in general and Chinese computational grammars in particular have generally been lacking in in-depth research on Chinese lexical base. For many years, most dictionaries published in China did not even contain information for grammatical categories in the lexical entries (except for a few dictionaries intended for foreign readers learning Chinese). Compared with the sophisticated design and rich linguistic information embodied in English dictionaries like Oxford Advanced Learners' Dictionary and Longman Dictionary of Contemporary English, Chinese linguistics is hampered by the lack of such reliable lexical resources.

In the last decade, however, Chinese linguists have achieved significant progress in this field. The publication of 800 Words in Contemporary Mandarin (Lü et al., 1980) marked a milestone for Chinese lexical research. This book is full of detailed linguistic description of the most frequently used Chinese words and their collocations. Since then, Chinese linguists have made fruitful efforts, marked by the publication of a series of valency dictionaries (e.g. Meng et al., 1987) and books  (e.g. Li, L. 1986, 1990). But almost all such work was done by linguists with little knowledge of computational linguistics. Their description lacks formalization and consistency. Therefore, Chinese computational linguists require patience in adapting and formalizing these results, making them implementable.

1.1.2. Integrated design

Most conventional grammars assume a successive model of morphology, syntax and semantics. We argue that this design is not adequate for Chinese natural language processing. Instead, an integrated grammar of morphology, syntax and semantics is adopted in W‑CPSG.

Let us first discuss the rationale of integrating morphology and syntax in Chinese grammar. As it stands, a written Chinese sentence is a string of characters (morphemes) with no blanks to mark word boundaries. In conventional systems, there is a procedure-based Chinese morphology preprocessor (so-called segmenter). The major purpose for the segmenter is to identify a string of words to feed syntax. This is not an easy task, due to the possible involvement of the segmentation ambiguity. For example, given a string of 4 Chinese characters da xue sheng huo, the segmentation ambiguity is shown in (1a) and (1b) below.

(1)                    da xue sheng huo

(a)        da-xue                          | sheng-huo
university                    | life

(b)        da-xue-sheng               | huo
university-student       | live

The resolution of the above ambiguity in the morphology preprocessor is a hopeless job because such structural ambiguity is syntactically conditioned. For sentences like da xue sheng huo you qu (university life is interesting), (1a) is the right identification. For sentences like da xue sheng huo bu xia qu le (university students cannot make a living), (1b) is right. So far there are no segmenters which can handle this properly and guarantee correct word segmentation (Feng 1996). In fact, there can never be such segmenters as long as syntax is not brought in. This is a theoretical defect of all Chinese analysis systems in the morphology-before-syntax architecture (Li, W. 1997a). I have solved this problem in our morphology-syntax integrated W‑CPSG (see 2.2. below).

Now we examine the motivation of integrating syntax and semantics in Chinese grammar. It has been observed that, compared with the analysis of Indo-European languages, proper Chinese analysis relies more heavily on semantic information (see, e.g. Chen 1996, Feng 1996). Chinese syntax is not as rigid as languages with inflections. Semantic constraint is called for in both structural and lexical disambiguation as well as in solving the problem of computational complexity.  The integration of syntax and semantics helps establish flexible ways of their interaction in analysis (see 2.3. below).

1.2. Major theoretical foundation: HPSG

The work on W‑CPSG is developed in the spirit of the linguistic theory Head-driven Phrase Structure Grammar (HPSG, proposed by Pollard & Sag, 1987). HPSG is a highly lexicalist theory, which encourages the integration of different components. This matches our design philosophy for implementing our Chinese computational grammar. HPSG serves as a desired framework to start this research with. We benefit most from the general linguistic ideas in HPSG. However, W‑CPSG is not confined to the theory-internal formulations of principles and rules and other details in HPSG versions (e.g. Pollard & Sag 1987, 1994 or later developments). We borrow freely from other theoretical sources or form our own theories in W‑CPSG to meet our goal of Natural Language Processing in general and Chinese computing in particular. For example, treating morphology as an integrated part of parsing and placing it right into grammar is our deliberate choice. In syntax, we formulate our own theory for configuration and word order. Our semantics differs most from any standard version of situation-semantics-based theory in HPSG. It is based on insights from Tesnière's Dependency Grammar (Tesnière 1959), Fillmore's Case Grammar (Fillmore 1968) and  Wilks' Preference Semantics (Wilks 1975, 1978) as well as our own semantic view for knowledge representation and better coordination of syntax-semantics interaction (Li, W. 1996). For these differences and other modifications, it is more accurate to regard W‑CPSG as an HPSG-style Chinese grammar, rather than an (adapted) version of Chinese HPSG.

  1. Integrated language model

2.1. W‑CPSG versus conventional Chinese grammar

The lexicalized design sets the common basis for the organization of the grammar in W‑CPSG. This involves the interfaces of morphology, syntax and semantics.[1]   W‑CPSG assumes an integrated language model of its components (see Figure 1).  The W‑CPSG model is in sharp contrast to the conventional clear-cut successive design of grammar components (see Figure 2).

 

 lw1

Figure 2.  conventional language model (non-reversible)

2.2. Interfacing morphology and syntax

As shown in Figure 2 above, conventional  systems take a two-step approach: a procedure-based preprocessor for word identification (without discovering the internal structure) and a grammar for word-based parsing. W‑CPSG takes an alternative one-step approach and the parsing is character- (i.e. morpheme-) based. A morphological PS (phrase structure) rule is designed not only to identify candidate words but to build word‑internal structures as well. In other words, W‑CPSG is a self-contained model, directly accepting the input of a character string for parsing. The parse tree embodies both the morphological analysis and the syntactic analysis, as illustrated by the following sample parsing chart.

lw6

Note:    DET for determiner; CLA for classifier; N for noun; DE for particle de;
AF for affix; V for verb; A for adjective; CLAP for classifier phrase;
NP for noun phrase; DEP for DE-phrase

This is so-called bottom-up parsing. It starts with lexicon look-up. Simple edges 1 through 7 are lexical edges. Combined edges are phrasal edges. Each edge represents a sign, i.e. a character (morpheme), a word, a phrase or a sentence. Lexical edges result from a successful match between the signs in the input string and the entries in the lexicon during lexicon look-up. After looking up the lexicon, the lexical information for the signs are made available to the parser. For the sake of concise illustration, we only show two crucial pieces of information for each edge in the chart, namely category and interpretation with a delimiting colon (some function words are only labeled for category). The parser attempts to combine the edges according to PS rules in the grammar until a parse is found. A parse is an edge which ranges over the whole string. The parse ((((1+2)+3)+4)+((5+6)+7)) represents the following binary structural tree embodying both the morphological and syntactic analysis of this NP phrase.

lw5

As seen, word identification is no longer a pre-condition for parsing. It becomes a natural by-product of parsing in this integrated grammar of morphology and syntax: a successful parse always embodies the right word identification. For example, the parse ((((1+2)+3)+4)+((5+6)+7)) includes the identification of a word-string zhe (DET) ben (CLA) shu (N) de (DE) ke-du-xing (N). An argument against the conventional separation model is that there exists in the two-step approach a theoretical threshold beyond which the precision for the correct word identification is not possible. This is because proper word identification in Chinese is to a considerable extent syntactically conditioned due to  possible structural ambiguity involved. Our strategy has advantages over the conventional approach  in  resolving word identification ambiguities and in handling the productive word formation. It has solved the problems inherent in the morphology-before-syntax architecture (for detailed argumentation, see Li, W. 1997a).

2.3. Interaction of syntax and semantics

The interface and interaction of syntax and semantics are of vital importance in a Chinese grammar. We are of the same opinion as Chen (1996) and many others that it is more effective to analyze Chinese in an environment where semantic constraints are enforced during the parsing, not after. The argument is based on the linguistic characteristics of Chinese. Chinese has no inflection (like English ‑'s, ‑s, ‑ing, ‑ed, etc.), no such formatives as article (like English a, the), infinitivizer (like English to) and complementizer (like English that). Instead, function words and word order are used as major syntactic devices. But Chinese function words (prepositions, aspect particles, passive particle, plural suffix, conjunctions, etc.) can often be omitted (Lü et al. 1980, p.2). Moreover, fixed word order in order to mark syntactic functions which is usually assumed for isolating languages, is to a considerable extent untrue for Chinese. In fact, there is remarkable freedom or flexibility in Chinese word order. One typical example is demonstrated in the numerous word order variations (although the default order is S‑V‑O subject-verb-object) for the Chinese transitive patterns  (Li, W. 1996).  All these added up project a picture of Chinese as a language of loose syntactic constraint. A weak syntax requires some support beyond syntax to enhance grammaticality. Semantic constraints are therefore called for. I believe that an effective way to model this interaction between syntax and semantics is to integrate the two in one grammar.

One strong piece of evidence for this syntax-semantics integration argument is that Chinese has what I call syntactically crippled structures. These are structures which can hardly be understood on purely formal grounds and are usually judged as ungrammatical unless accompanied with the support from the semantic constraints (i.e. the match of semantic selection restrictions). Some Chinese NP predicate (Li, W. & McFetridge 1995) and transitive patterns like S‑O‑V (Li, W. 1996), among others, are such structures. The NP Predicate is a typical instance of semantic dependence. It is highly undesirable if we assume a general rule like S --> NP1 NP2 in a Chinese grammar to capture such phenomena. This is because there is a semantic condition for NP2 to function as predicate, which makes the Chinese NP predicate a very restricted pattern. For example, in the sentence This table is three-legged: zhe (this) zhang (classifier) zhuo-zi (desk) san (three) tiao (classifier) tui (leg), the subject must be of the semantic type animate or furniture (which can have legs). The general rule with no recourse to semantic constraints is simply too productive and may cause severe computational complexity. In the case of Chinese transitive patterns, formal means are decisive for some variations in their interpretation (i.e. role assignment) process. But others are heavily dependent on semantic constraint. Take chi (eat) as an example. There is no difference in syntactic form in sentences like wo (I) chi (eat) dianxin (Dim-Sum) le (perfect-aspect) and dianxin (Dim-Sum) wo (I) chi (eat) le (perfect-aspect). Who eats what? To properly assign roles to NP1 NP2 V as S-O-V versus O-S-V, the semantic constraint animate eats food needs to be enforced.

The conventional syntax-before-semantics model has now received less popularity in Chinese computing community. Researchers have been exploring various ways of integrating syntax and semantics in Chinese grammar (Chen 1996). In W‑CPSG, the Chinese syntax was enhanced by the incorporation of a semantic constraint mechanism. This mechanism embodies a lexicalized knowledge representation, which parallels to the syntactic representation in the lexicon. I have developed a way to dynamically coordinate the syntactic constraint and semantic constraint in one model. This technique proves to be effective in handling rhetorical expressions and in making the grammar both precise and robust (Li, W 1996).

 

  1. Lexicalized formal grammar

3.1. Formalized grammar

The application nature of this research requires that we pay equal attention to practical issues of computational systems as well as to a sound theoretical design. All theories and rule formulations in W‑CPSG are implementable. In fact. most of them have been implemented in our prototype W‑CPSG. W‑CPSG is a strictly formalized grammar that does not rely on undefined notions. The whole grammar is represented by typed feature structures (TFS), as defined below based on Carpenter & Penn (1994).

(3)        Definition: typed feature structure 

A typed feature structure is a data structure adopted to model a certain object of a grammar. The necessary part for a typed feature structure is type. Type represents the classification of the feature structure. A simple feature structure contains only the type information, but a complex feature structure can introduce a set of feature-value pairs in addition to the type. A feature-value pair consists of a feature and a value. A feature reflects one aspect of an object. The value describes that aspect. A value is itself a feature structure (simple or complex). A feature determines which type of feature structures it takes as its value. Typed feature structures are finite in a grammar. Their definition constitutes the typology of the grammar.

With this formal device of typed feature structures, we formulate W‑CPSG by defining from the very basic notions (e.g. sign, morpheme, word, phrase, S, NP, VP, etc.) to rules (PS rules and lexical rules), lexical items, lexical hierarchy and typology (hierarchy embodied in feature structures) (Li, W. 1997b). The following sample definitions of some basic notions illustrate the formal nature of W‑CPSG. Please note that they are system-internal definitions and are used in W‑CPSG to serve the purpose of configurational constraints (see Chapter VI of Li, W. 1997b).

(4)        Definition: sign [2]

a_sign
KANJI kanji
MORPH expected
CATEGORY category
COMP0 expected
COMP1 expected
COMP2 expected
MOD expected
KNOWLEDGE knowledge
CONTENT content
DTR dtr

A sign is the most fundamental concept of grammar. A sign is a dynamic unit of grammatical analysis. It can be a morpheme, a word, a phrase or a sentence. Formally, a sign is defined by the TFS a_sign, which introduces a set of linguistic features for its description, as shown above. These features include the orthographic feature KANJI; morphological feature MORPH; syntactic features CATEGORY, COMP0, COMP1, COMP2, and MOD; structural feature (for both morphology and syntax) DTR; semantic features KNOWLEDGE and CONTENT.

(5)        Definition: morpheme

a_sign
MORPH ~saturated

A morpheme is a sign whose morphological expectation has not been saturated. In W‑CPSG, ~saturated is equivalent to obligatory/optional/null. For example, the suffix ‑xing (‑ness) is such a morpheme whose morphological expectation for a preceding adjective is obligatory.  In W‑CPSG, a morpheme like ‑xing (‑ness) ceases to be a morpheme when its obligatory expectation, say the adjective ke-du (readable), is saturated. Therefore, the sign ke-du-xing (readability) is not a morpheme, but becomes a word per se.

(6)        Definition: word

a_sign
MORPH ~obligatory
DTR no_syn_dtr

In W‑CPSG, ~obligatory is equivalent to saturated/optional/null. The specification [MORPH ~obligatory] defines a syntactic sign, i.e. a sign whose obligatory morphological expectation has been saturated. A word is a syntactic sign with no syntactic daughters, i.e. [DTR no_syn_dtr]. Obviously, word with [MORPH saturated/optional/null] overlaps morpheme with [MORPH obligatory/optional/null] in cases when the morphological expectation is optional or null.

Just like the overlapping of morpheme and word, there is also an intersection between word and phrase. Compare the following definition of phrase with the above definition of word.

(7)        Definition: phrase

a_sign
MORPH ~obligatory
COMP0 ~obligatory
COMP1 ~obligatory
COMP2 ~obligatory 

A phrase is a syntactic sign whose obligatory complement expectation has all been saturated, i.e. [COMP0 ~obligatory, COMP1 ~obligatory, COMP2 ~obligatory]. When a word has only optional complement expectation or no complement expectation, it is also a phrase. The overlapping relationship among morpheme, word and phrase can be shown by the following illustration of the three sets.

lw4 

S is a syntactic sign satisfying the following 3 conditions: (1) its category is pred (which includes V and A); (2) its comp0 is saturated; (3) its obligatory comp1 and comp2  are saturated.

3.2. Lexicalized grammar

W‑CPSG takes a radical lexicalist approach. We started with individual words in the lexicon and have gradually built up a lexical hierarchy and the grammar prototype.

W‑CPSG consists of two parts: a minimized general grammar and a information-enriched lexicon. The general grammar contains only 11 PS rules, covering complement structure, modifier structure, conjunctive structure and morphological structure. We formulate a PS rule for illustration.

lw3

This comp0 PS rule is similar to the rule S ==> NP VP in the conventional phrase structure grammar. The feature COMP0 represents the expectation of the head daughter for its external complement (subject or specifier) on its left side, i.e. [DIRECTION left]. The nature of its expected comp0, NP or other types of sign, is lexically decided by the individual head (hence head-driven or lexicon-driven). It will always be warranted by the general grammar, here via the index [3]. This is the nature of lexicalized grammars. PS rules in such grammars are very abstract. Essentially, they say one thing, namely, 2 signs can combine so long as the lexicon so indicates. The indices [1] and [2] represent configurational constraint. They ensure that internal obligatory complements COMP1 and COMP2 must be saturated before this rule can be applied. Finally, Head Feature Principle (defined elsewhere in the grammar based on the adaptation of the Head Feature Principle in HPSG, Pollard & Sag, 1994) ensures that head features are percolated up from the head daughter to the mother sign.

The lexicon houses lexical entries with their linguistic description and knowledge representation. Potential morphological structures, as well as potential syntactic structures, are lexically encoded (in the feature MORPH for the former and in the features COMP0, COMP1, COMP2, MOD for the latter). Our knowledge representation is also embodied in the lexicon (in the feature KNOWLEDGE). I believe that this is an effective and realistic way of handling natural language phenomena and their disambiguation without having to resort to an encyclopedia-like knowledge base. The following sample formulation of the lexical entry chi (eat) projects a rough picture of what the W‑CPSG lexicon looks like.

lw2

The lexicon also contains lexical generalizations. The  generalizations are captured by the inheritance of the lexical hierarchy and by a set of lexical rules. Due to space limitations, I will not show them in this paper.

  1. Implementation and application of W‑CPSG

A substantial Chinese computational grammar has been implemented in the W‑CPSG prototype.  It covers all basic Chinese constructions. Particular attention is paid to the handling of function words and verb patterns.  On the basis of the information- enriched lexicon and the general grammar, the system adequately handles the relationship between linguistic individuality and generality. The grammar formalism which I use to code W‑CPSG is ALE, a grammar compiler on top of Prolog, developed by Carpenter & Penn (1994). ALE  is equipped with an inheritance mechanism on typed feature structures, a powerful tool in grammar modeling. I have made extensive use of the mechanism in the description of lexical categories as well as in knowledge representation. This seems to be an adequate way of capturing the inherent relationship between features in a grammar. Prolog is a programming environment particularly suitable for the development of unification and reversible grammars (Huang 1986, 1987). ALE compiles W‑CPSG into a Chinese parser, a Prolog program ready to accept a string of characters for analysis. In the first experiment, W‑CPSG has parsed a corpus of 200 Chinese sentences of various types.

An important benefit of a unification-based grammar is that the same grammar can be used both for parsing and generation. Grammar reversibility is a highly desired feature for multi-lingual machine translation application. Following this line, I have successfully applied W‑CPSG to the experiment of bi-directional machine translation between English and Chinese. The machine translation system developed in our Natural Language Lab is based on the shake-and-bake design (Whitelock 1992, 1994). I used the same three grammar modules (W‑CPSG, an English grammar and a bilingual transfer lexicon) and the same corpus for the experiment. As part of machine translation output, W‑CPSG has successfully generated the 200 Chinese sentences. The experimental results meet our design objective and verify the feasibility of our approach.

 

References

 

Carpenter, B. & Penn, G. (1994): ALE, The Attribute Logic Engine, User's Guide

Chen, K-J.  (1996): "Chinese sentence parsing" Tutorial Notes for International Conference on Chinese Computing ICCC'96, Singapore

Feng, Z-W.  (1996): "COLIPS lecture series - Chinese natural language processing",  Communications of COLIPS, Vol. 6, No. 1 1996, Singapore

Fillmore, C. J. (1968): "The case for case". Bach and Harms (eds.), Universals in Linguistic Theory. Holt, Reinhart and Winston, pp. 1-88.

Huang, X-M. (1986): "A bidirectional grammar for parsing and generating Chinese".  Proceedings of the International Conference on Chinese Computing, Singapore, pp. 46-54

Huang, X-M. (1987): XTRA: The Design and Implementation of A Fully Automatic Machine Translation System, Doctoral dissertation, University of Essex.

Li, L-D. (1986): Xiandai Hanyu Juxing (Sentence Patterns in Contemporary Mandarin), Shangwu Yinshuguan, Beijing

Li, L-D. (1990): Xiandai Hanyu Dongci (Verbs in Contemporary Mandarin), Zhongguo Shehui Kexue Chubanshe, Beijing

Li, W. & P. McFetridge (1995): "Handling Chinese NP predicate in HPSG", Proceedings of PACLING-II, Brisbane, Australia

Li, W. (1996): "Interaction of syntax and semantics in parsing Chinese transitive patterns", Proceedings of International Conference on Chinese Computing (ICCC'96), Singapore

Li, W. (1997a): "Chart parsing Chinese character strings", Proceedings of The Ninth North American Conference on Chinese Linguistics (NACCL-9, to be available), Victoria, Canada

Li, W. (1997b): W‑CPSG: A Lexicalized Chinese Unification Grammar, Doctoral dissertation, Simon Fraser University (on-going)

Lü, S-X. et al. (ed.) (1980): Xiandai Hanyu Babai Ci (800 Words in Contemporary Mandarin), Shangwu Yinshuguan, Beijing

Meng, Z., H-D. Zheng, Q-H. Meng, & W-L. Cai (1987): Dongci Yongfa Cidian (Dictionary of Verb Usages), Shanghai Cishu Chubanshe, Shanghai

Pollard, C.  & I. Sag (1987): Information based Syntax and Semantics Vol. 1: Fundamentals. Centre for the Study of Language  and Information, Stanford University, CA

Pollard, C.  & I. Sag (1994): Head-Driven Phrase Structure Grammar,  Centre for the Study of Language and Information, Stanford University, CA

Shieber, S. (1986): An Introduction to Unification-Based Approaches to Grammar. Centre for the Study of Language  and Information, Stanford University, CA

Tesnière, L. (1959): Éléments de Syntaxe Structurale, Paris: Klincksieck

Whitelock, Pete (1992): "Shake and bake translation", Proceedings of the 14th International Conference on Computational Linguistics, pp. 784-790, Nantes, France.

Whitelock, Pete (1994). "Shake and bake translation", C.J. Rupp, M.A. Rosner, and R.L. Johnson (eds.), Constraints, Language and Computation, pp. 339-359, London, Academic Press.

Wilks, Y.A. (1975). "A preferential pattern-seeking semantics for natural language interference".  Artificial Intelligence, Vol. 6, pp. 53-74.

Wilks, Y.A. (1978). "Making preferences more active".  Artificial Intelligence, Vol. 11,  pp. 197-223

 

-------------------------------------

* This project was supported by the Science Council of British Columbia, Canada under G.R.E.A.T. Award (code: 61) and by my industry partner TCC Communications Corporation, British Columbia, Canada. I thank my academic advisors Paul McFetridge and Fred Popowich and my industry advisor John Grayson for their supervision and encouragement. Thanks also go to my colleagues Davide Turcato, James Devlan Nicholson and Olivier Laurens for their help during the implementation of this grammar in our Natural Language Lab. I am also grateful to the editors of the NWLC'97 Proceedings for their comments and corrections.

[1] We leave aside the other components such as discourse, pragmatics, etc. They are an important part of a grammar for a full analysis of language phenomena, but they are beyond what can be addressed in this research.

[2] In formulating W‑CPSG, we use uppercase for feature and lowercase for type; ~ for logical not and / for logical or; number in square brackets for unification.

 

[Related]

Outline of An HPSG-style Chinese Reversible Grammar ABSTRACT

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

立委硕士论文:目标语调序 (9)

世界语到汉语和英语的自动翻译试验
-- EChA机器翻译系统概述

目标语调序

在前面的虚词一线和形态生成一线, 已经做了一些局部调序并给了同号. 如:

CHIO (一切) CHI (这) ----> 这一切 (012);
DOKTORO (博士) ZAMENHOF (柴门霍夫) ----> 柴门霍夫博士 (134)

英语疑问句和否定句所需要的调序, 就放在形态生成的同时进行. 如:

NE (NOT) ESTIS (WERE) ----> WERE NOT (008)

CHU VIA (YOUR) AMIKO (FRIEND) ESTAS (IS) KURACISTO (DOCTOR) ?
----> IS YOUR FRIEND DOCTOR ? (039)

从综合第二线开始, 系统从句子整体着眼, 自底而上分别做各目标语的归约调序. 有了CDC和调序子程序, 建立目标语的归约生成算法就很简单了. 其基本思路是:

(1) 由句首至句末依次取词, 放过已加工和非终结节点.
(2) 若该词层号为一, 右链为零, 说明已经归约到顶层主轴心, 该句加工完毕.
(3) 若该词需要调序, 入调序子程序.
(4) 该词做已加工特征, 并视情况决定是否给该词以轴心词同号.
(5) 入子程序检查该词的姐妹词是否也都已加工.
(6) 若是, 则该词及其所有姐妹词给以轴心词同号, 轴心词做终结节点特征.
(7) 返回第(1)步.

对于英语, 问题特别简单, 只有一种情况需要调序, 即及物谓语所带的前置宾语和后置主语. (不及物谓语句中的后置主语无需调序.) 汉语的问题就复杂得多, 主要规则有:

(1) 存在 "有" (ESTI) 的主语应后置. 除此以外, 后置主语(包括多数主语从句)一律前移.

(2) 要求带 "把", "使" 等的汉语及物动词做谓语的句子, 其宾语在加上 "把", "使"等以后, 应置于谓语前. 除此以外, 前置宾语一律后移.

(3) 后置定语从句在两种情况下不需前移: 1. ESTAS + X, KIU 型强调句式; 2. 长15词以上的定语从句. 其余的所有后置定语一律前移. 各姐妹定语的相对位置主要由它们的语义特征决定, 具体是通过调序时给或不给同号来实现.

(4) 状语从句一般原位不动(但后置时间状语从句最好前移). 其余后置状语一律前移. 各姐妹状语相对位置的处理原则同上.

 

 

【相关】

硕士论文: 世界语到汉语和英语的自动翻译试验
立委硕士论文:1. EChA概况
立委硕士论文:2. 世界语: 语言学特点及其研究价值
立委硕士论文:3. 层次递归成分体系
立委硕士论文:4. EChA机器词典及词表
立委硕士论文:5. 世界语形态分析
立委硕士论文:6/7 世界语句法分析
立委硕士论文:8. 英语形态生成
立委硕士论文:9. 目标语调序
立委硕士论文:10. EChA 试验结果的分析
立委硕士论文【致谢】【参考书目】
立委硕士论文全文(世界语版)

《朝华午拾:shijie-师弟轶事(3)——疯狂世界语 》

灵感有如神授,巧夺岂止天工

《立委随笔:一小时学会世界语语法》

立委世界语文章 (1987): 《中国报道:通天塔必将建成》

立委世界语论文(1986): 《国际语到汉语和英语的自动翻译》

立委(1988)《世界科技:世界语到汉语和英语的自动翻译试验》

DLT项目背景介绍

立委硕士论文全文(世界语版)

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

【关于机器翻译】

【置顶:立委NLP博文一览】

《朝华午拾》总目录