S. Bai: Fight for New Portals

Author: Bai Shuo

Recently, Amazon’s AI product Echo and its voice assistant Alexa set off a whirlwind in the industry.  It has drawn attention from not only the smart home industry but also the AI start-ups as well as the IT giants.  So, what exactly is unique about Alexa?

Recently, Amazon’s AI product Echo and its voice assistant Alexa set off a whirlwind in the industry.  It has drawn attention from not only the smart home industry but also the AI start-ups as well as the IT giants.  So, what exactly is unique about Alexa?

Some people say that Alexa has solved the challenging “cocktail party” problem in speech recognition: imagine a noisy cocktail party, where a person is chatting with you, the voice is not loud, but you can accurately capture the speech with no problem while ignoring the surrounding big noise. Alexa models this amazing human capability well, which is said to be missing from other leading speech players, including the global speech leader USTC iFLYTEK Co.

Others say that behind Alexa are very rich cross-domain know-hows:  one can ask Alexa for on-demand programs, he can also buy goods and services through it; it can be instructed to control the various appliances of our home, or inquire about all kinds of news.  All in all, this is a voice assistant backed by a strong service (with some resources local, and more in the cloud).  Apple’s Siri or Microsoft’s Little Ice are believed to be by no means a match for Alexa in terms of these comprehensive capabilities.

The excellent performance by the end device, coupled with the huge cloud resources in support of the end, constitute Alexa’s expected success in customers’ stickiness, leading to its legendary value as an information portal for a family.  That seems to be a good reason for Alexa’s impressive market performance in the US.  A considerable number of people seem to realize that this may represent a huge business opportunity, one that simply cannot be missed without regret.  Although in other markets beyond the United States, Alexa’s performance is not as eye-catching as in the US market, this Alexa whirlwind has till been scraping the world, leading to the industry’s greatest buzz and triggering a long list of smart speaker simulation shows.

Hence the questions: What are the effects of this invention of Alexa? Who will be affected or even replaced?  How to evaluate Alexa’s portal value? Where is it going as we look into the yesterday, today and tomorrow of this trend?

We may wish to reflect a bit on the development of portals in the IT industry history.  The so-called “portal” is an entry point or interface for an information network of large data flow, connecting consumers and services.  From the model perspective, we have experienced the “web portal” model, the “search engine” model and more recently, the “social network” model, with the on-going trend pointing to a portal moving in the “artificial intelligence” mode. From the carrier perspective, the carrier for the”web portal” and “search engine” models is basically a PC while the “social network” model carrier is mainly a smart phone-based end equipment. Does the “artificial intelligence” model have the potential to change the carrier? In other words, is it possible for the Echo-Alexa hardware-software combination, under the banner of artificial intelligence, to win the portal from the smart phone as the select point of human-machine interface?

I don’t think it is possible.  There are three reasons.

First, the scene is wrong. Even if Alexa is powerful with unique anti-noise ability and the skills of tracking specific people’s speech, since its location is fixed, it is a huge regression from today’s well-developed mobile scenes.  Just think about it, the biggest feature of a family scene is two or more individuals involved in it.  A family is a small society with an innate structure.  Who has the right to issue voice commands? Who has the authority to deny or revoke the voice commands that others have already issued? What happens if the authoritative person is not at home or keeps silent? What if a family member intends to send a private voice instruction? To my mind, voice instruction as a human-machine interaction vehicle by nature involves behaviors of an individual, rather than of a family, with privacy as a basic need in this setting.  Therefore, the family voice portal scene, where Alexa is now set, is likely to be a contradiction. The more voice commands that are parsed and understood, the less will be the proportion of the voice commands that take the home scenes as a necessary condition.

Second, the “horizontal” mode of portal faces the “vertical” resistance.  Even if we agree that the “smart home central control” is a portal of access to end users that cannot be missed by any players, smart speakers like Alexa are also facing challenges from other types of end equipment.  There are two types of data flow in the smart home environment.  The horizontal mode involves the data flow from different manufacturers of home equipment.  The vertical mode portal gathers data from the same manufacturer’s home equipment.  It can be seen that the “horizontal” effort is bound to face the “vertical” resistance in a life and death struggle.  For example, the smart refrigerator and other smart home equipment manufactured by Haier have no reasons to let go its valuable data and flow it away to the smart speaker manufacturers.

Third, the same struggle also comes from other competitions for the “horizontal” line of equipment, including house robots, home gateway / intelligent routers, smart TVs, intelligent pendants and so on.  The advantage of the house robots is that their locations need not be fixed in one place, the advantage of the home gateway is that  it always stays on, the TVs’ advantage lies in their big screens, and intelligent pendants (such as picture frames, sculptures, watches, scales, etc.) have their respective advantage in being small.  In my opinion, smart speakers face all these “horizontal” competitions and there does not seem to be much of a chance in winning this competition.

In summary, the Echo-Alexa’s success comes with a strong superposition characteristic. It is essentially a success of the Amazon business system, rather than the success of smart home appliances or the voice assistant technology. Ignoring the role of its supporting business system, we are likely to overestimate the value of the family information portal, and by simply mimicking or following the smart speaker technology, there is no way out.  Personally, I feel that the smart phone as the carrier of an entry point of information in the mobile Internet era still cannot be replaced.

Is the era of voice interaction really coming?

One important reason for the IT giants to look up to Alexa is that the voice interaction represented by Alexa perhaps opens a new paradigm of human-computer interaction.  Looking back in history, the rise of the click-mode and the rise of the touch-mode have both triggered a revolutionary paradigm shift for human-computer interaction, directly determining the rise and fall of the IT giants. The click-mode led to the rise of Wintel, the touch mode enabled Apple to subvert Wintel: we have witnessed all these changes with our own eyes.  So if the voice interaction really represents the next generation paradigm for human-computer interaction, then Alexa has a special meaning as the precursor of the human-computer interaction paradigm shift.  The giants simply cannot overlook such a shift and its potential revolutionary impact.

However, personally, I do not think that the speech interaction alone carries the weight for an “intergenerational revolution” for human-machine interaction.   There are three reasons to support this.

First, the speech itself does not constitute a complete human-computer interaction scene.  People’s information intake, more than 80% of times, involves the visual information.  When speaking, we often take some visual information as basic context, through the use of a pronoun to refer to it.  For example, pointing to a book on the screen, one may say, “I want to buy this.” In other words, a considerable part of the context in which the speech is delivered comes from the visual presentation, ranging from gestures, touches or eye movements that target some visual objects. This at least shows that we need multi-modal human-computer interaction, rather than using voice alone to replace other human-computer interaction vehicles.

Second, the current speech recognition still cannot handle the dialect well.  China is a big country with a variety of dialects.  Not only dialects, but also the people in dialect areas speack Mandarin with a strong accent. To benefit more than half of the total population in the dialect areas, the speech technology still needs to go through a stage of further development and maturity.

Third, the current speech recognition still has difficulty in solving the “escape” problem. The so-called escape problem involves the identification of scenarios when the speech refers to itself.  When people find there is an error in the first utterance and there is a need to correct it, they may choose to use the next sentence to correct the previous sentence, then this new sentence is not part of the naturally continuous speech commands, hence the need for “being escaped”.  But it is also possible that the latter sentence should not be escaped, and it is a sentence conjoined with the previous sentence, then it is part of the normal speech stream.  This “escape” identification to distinguish different levels of speech referents calls for more advanced semantic analysis technology, which is not yet mature.

So, considering the current level of speech technology, it seems too early to talk about the “intergenerational revolution”.  Furthermore, speech may well be just one factor, and not necessarily a disruptive one.  It seems more reasonable to state that the future of human-computer interaction may enter an era of multi-modal input, rather than speech alone.

The semantic grounding is the key to the stickiness of users.

Semantics as a term seems abused in all kinds of interpretations.  Some even think that once words are identified, semantics is there, which is far from true. The semantics of natural languages is very deep and involves a lot.  I mean a lot!

From the academic point of view, semantics is divided into two parts.  One called “symbol grounding”, which is about the relationship of the language symbol (signifier) and its referent to the real world entity (including the conceptual world).  The second is called “role assignment”, which is about the relationship between the referents of the language symbols in the reality.  Siri is the pioneer in the mobile semantic grounding realized in the domain apps such as Address, Map and Weather.  The past few years have seen the scope of semantic grounding grow wider and wider.

Let me review what I said before: “the excellent performance by the end equipment, coupled with the huge cloud resources in support of the end, constitute the Alexa’s expected success in users’ stickiness”.  We can further explore along this line in this section.  Between “the performance by the end equipment” and “the cloud resources in support of the end”, which is the root cause for Alexa’s stickiness with the customers?  I do not intend to play the trick of dialectical balance by saying something like both are important and no one can do the job without the other.  That is always true but cheap, and it gives no actionable insights. The consequence includes possible blind investments in both for the copycat, such investments may well lead to a complete failure in the market.

The author argues that “the performance by the end equipment” is about the adaptability of the hardware to the scene.  This is at best about a “good live experience” of users. But a product with “good user experience” without real content will soon degrade to a toy, and they cannot even count as high-end toys.  If there is no real “meaningful service” associated, there will be no sustainable stickiness of customers. Without user stickiness, they cannot become sustainable data collection entry points as a data flow portal.  However, any associated “meaningful services” must come from the semantic grounding, that is, the connection from a speech command with its corresponding actual service.  This is the essence behind Alexa’s so-called “know-hows.”  Semantic grounding as mentioned hereafter all refers to such connection from the speech command with infinitely possible actual service resources.

Comprehensive semantic grounding requires a strong open-domain NLP engine. Service resources are so diverse in tens of thousands, and they can hardly be confined to one or only a few narrow domains.  An NLP engine functioning only in a narrow domain cannot do this job well.  To work in the open domain requires an engine to be equipped with extraordinary capacity in the semantic analysis, and it must be on the right path in the semantic knowledge representation and processing.  In this regard, even if an English engine is doing decently well, it does not necessarily mean the Chinese counterpart will work well.  For those who do not yet understand the difficulty and pain points of the Chinese NLP engine in the open domain, it is hardly possible to expect them to achieve large-scale semantic grounding effects. Such technology barriers can set apart a huge gap in products attempting to do the same thing in the market between companies equipped with or without deep semantic capabilities.

Semantic grounding requires an engineering adaptation at the interface to the service resources.  This is also a very difficult task, and it involves competitions in the scale of resources as well as efficiency and management. Start-up companies can hardly have such a resource integration capacity and the engineering organization capabilities, these are the strength of large companies. Some people say that I can start small and gradually scale up, okay? I said, no, time does not wait for people.  In the area of semantic grounding, if products are not developed in a relatively short time to capture the market, there are little chances for survival.

Semantic grounding also calls for the ability to manage the man-machine interactive scene itself. This involves a variety of technologies such as contextual perception, topic switching, sentiment analysis, language style selection, personality shaping and many others. A speech assistant is not necessarily the best if it only mimics human’s eloquence or seemingly likable ways of expressions. Skills such as moderate profoundness or sharpness in arguments and even some rudeness at times can all be selling points as an intelligent assistant.

Therefore, we would point out the key role of semantic grounding on the stickiness of Alexa users, emphasizing the decisive contribution of large service resources behind Alexa’s success story.  In China, if Chinese IT giants with a comparable size of the Amazon service resources do not take the lead, coupled by a solid open domain Chinese NLP engine with a star team, the speech technology alone has no way to generate such a user stickiness as we see in Alexa.

Who will win then?

In essence, it is all about gathering the user data by the end equipments.  Smartphones dominate the industry for years, all kinds of smart home solutions across the verticals have also been fighting for several years now.  Alexa’s coming to the market stirs the industry with a lot of excitement and revelations, but it is far from what is all set.  We still have opportunities.  But keep in mind, it cannot be overemphasized to look into issues involving the combination of the end devices with the cloud and the combination between the entry point and the entry point carrier to form a closed-loop data stream.  If we lose the sense of directions and trends in these issues, the opportunity will not be ours.

So what is the direction and what are the trends? Let me give an analysis.

First, artificial intelligence is bound to be the next generation portal. In other words, all kinds of service needs will inevitably go from the end devices to the cloud through the artificial intelligence multi-channel input analysis, leveraging the human-computer interaction advantages.  The variety of service resources will eventually use the knowledge of artificial intelligence and cognitive decision-making ability, to provide to users from the cloud to the end. If you do not lay out a roadmap in developing artificial intelligence, the future portal is definitely not yours.

Second, the smartphone for a long time to come will stay as defacto chief carrier. Wherever is the person going, the communication node and the digital identity will follow and the perception of the life scene and the app as the service agent will also follow. There are no other end devices that match the smartphone on the most critical dimensions of the individualness, privacy, and the ubiquitous nature as needed by a portal carrier.

Third, there will be separation between the communication function of a terminal device and the demanded service function. As the service grows more and more diversified, it becomes impossible for one end device to handle all types of service needs.  But it is not desirable for each end device to come with its own communication function.  The relationship between Apple Watch and iPhone is intriguing in this regard: iPhone serves as the communication hub as well as the client information processing hub while Apple Watch functions as a special device for information collection and limited information display.  They are connected through a “near field communication” link.  Of course, both are Apple’s products in one family, the data flow is therefore under a unified control.  In such a setting, they are tightly coupled, and the separation is always limited. However, this mode sheds lights to the future when all kinds of separation may be required but they should also be connected in some way.  If the mobile phone manufacturers keep an open mind, they can use the block chain technology in data collection with a variety of ancillary equipment to make an objective record of the respective contributions and accordingly make reasonable arrangements with regards to the data and proceeds sharing. A loose coupling of the separation will then evolve and mature, promoting the rapid ecological development of end devices in all kinds of forms. It is imaginable that, when we are in a new place, we can take out from our pocket a soft thin foldable electronic map.  This map, when unfolded, looks as big as a real paper map, but it works conveniently just like a mobile map app: it responds to the touch operations and may even accommodate speech instructions to associate with our phone. Of course, this map can also simply be a virtual projection, not necessarily taking the form of a real object.  Our phone only needs to take care of communication, all the control and display are accomplished on the map, and we do not even need to physically take out the phone. Such a phone may never need to be held in hands, we may even wear the phone on the foot, and the hand mobile device gradually evolves into a “foot phone” … …

Are you ready for the opportunity and inspirations brought by the Alexa whirlwind?

Translated by: Dr. Wei Li based on GNMT


S. Bai: Natural Language Caterpillar Breaks through Chomsky’s Castle

Dr Wei Li’s English blogs




Trap of Information Overdose

Today, my topic relates to the issue of information overload.

We are born in the era of big data and information overload. As an NLPer (Natural Language Processor), for years I have been stuck in the belief that my sole mission is to help solve this problem of information overload. Just like Alibaba’s Jack Ma’s vision that there should be no barriers for any business in this e-commerce world, my colleagues and I seem to share the vision in the community that there should be no barriers for instant access to any information amid the big data. So Google appeared, with crude keywords as basis and with its insatiable appetite to cover as big data as possible, to  have solved the problem of information long tail. Today whatever your query, and however rare your information need is, you google it and you get some relevant info back. We don’t want to stop there, so we begin to criticize Google because its solution to the information on the long tail has the defect of poor data quality. Hence AI (Artificial Intelligence) is proposed and being practiced to enhance the deep processing of data (whether via deep learning or deep parsing), in an attempt to both handle big data for its long tail, as well as to drastically raise the data quality through natural language understanding (NLU). The aim is to satisfy any soul with information needs, whether explicitly requesting it or implicitly carried in the mind, by a steady flow of quality information. This is the perspective from us practitioners’ point of view, currently mixed with lots of excitement and optimism.

Let us change our perspective to ask ourselves, as a consumer, what have we benefited from this exciting AI battle on information overload? Indeed, what we now get is more and more data — to the point, high-quality, with constant and instant feeds, which we have never before been able to reach. Previously we were drowned in the overload of the information ocean, mostly garbage occasionally with a few pearls, and nowadays we end up being choked to death by over-satisfaction of quality information thanks to the incredible progress of information high-tech via AI. So the feelings are dramatically different, but the ending remains the same, both are an inescapable path to death, drowned or choked. So each day we spend more and more time in the social media among our circles of friends, on all types of news apps, or entertainment apps, with less and less time for real-life work, family and serious thinking. Numerous geniuses out there (many are my talented peers) racked their brains to study our preferences, study how to make us stick to their apps, and what tricks they can apply to drive us crazy and addicted to their products.

It is the iron law that a person is no match for a calculated and dedicated world. Made of flesh and blood, each individual consumer is no match for an invisible legion of tech gurus (including myself) from businesses and their accomplices in the information industry, looking closely into our behavior and desires. So we are bound to sink to the bottom, and eventually become a slave of information. Some of us begin to see through this trap of information overdose, struggling hard to fight the addiction, and seeking self-salvation against the trend. Nevertheless, with the rapid progress of artificial intelligence and natural language technology, we see the trend clear, unstoppable and horrifying: more and more are trapped in the info, and those who can save themselves with a strong will are a definite minority.

The world has n billion people, and m million organizations, each producing information non-stop every moment, which is now recorded one way or the other (e.g. in social media). Even if we raise our bar higher and higher for our information needs for work and for pleasure, to the extent of an incredible ratio to the effect of something like ten-millionth, using a variety of technology filters of information, we are still faced with info feeds from n-hundred human entities and m-organizations. There is simply no way in our lifetime to exhaust it all and catch up with its feeds. We end up feeling over-satisfied with information most of which we feel we simply cannot and should not miss. We are living in a terrible bliss of an over-satisfying world. As consumers we are doomed in this battle to fight the addiction against our own nature, trying to resist the temptation that by nature cannot be resisted.

Having pointed out the problem, I have no effective remedy to this problem to offer. What I myself do is that at times, I simply shut down the channels to stay in info-diet or hungry mode, focusing on family and the accumulated to-do list of work. This seems to work and I often got my work done, without feeling I have missed that much for the information gap during the “diet” period, but it is not a sustainable technique (with exception perhaps of very few super guys I know whom I admire but really cannot tell whether that lifestyle is really for better or not as shutting the info channels for too long has its own side effects, or consequences, to my mind). In the end, most of us fall back to being willing slaves of information. The smarter minds among us have learned to shift between these two modes: shutting channels down for some time and going back to the “normal” modern way of information life.

For people who want and need to kill time, for example, the retired in the lonely senior homes, info age is God-sent: their quality of killing time has never been made better. But how about the younger generation who is most vulnerable to info overdose, as much as the addiction to the crazily popular games today. The “shutting the channels” technique is a survival skill of middle-aged generation who needs to dedicate sufficient time to go about their daily work and life, making a living, supporting the family and keeping it running. But this technique is almost impossible for the young generation to practice, given that they are born in this info age, and social media and stuff are part of their basic lifestyle. Nevertheless, there is no short of struggles and helplessness as we observe when they are being drowned in the sea of games, social media and Internet, in front of the academic pressure and career training competition. The external world is not in the least prepared and is basically helpless to them. So are us parents. Many times we cannot resist the temptation from being enslaved in the information trap for ourselves, how can we expect our next generation to learn the balancing skill easily, considering they are at the age of exploration with tremendous curiosity and confusion.

Sometimes I tell myself: why should we work so hard on info technology if we know it has both positive effects as well as huge negative impact which we have no clues how to fix. After all, we do not need to rush the entire world of life and time to be engulfed by info no matter how high quality we can make it to be. Meanwhile, I really hope to see more and more study to get invested in addressing how to help people resist the temptation of the information trap. The ideal world in my understanding should be that we stay equipped with both intelligent tools to help access quality information as nutrients to enrich our lives, as well as tools to help resist the temptation from info over-satisfaction.

Translated and recompiled from the original post in my Chinese blog: 【杞人忧天:可怕的信息极乐世界




Dr Li’s NLP Blog in English


Small talk with Daughter on US Election

just had a small talk with Tanya on US election, she was super angry and there was a big demonstration against Trump in her school too

I don’t want him to win
I don’t want him to do well
Or else another racist gets electedMe:

neither did I
IF he does very badly, he will be impeached;
or at least he will not be reelected in 4 years.
But now that he is, we can keep an open mind.
There is an element of sentiment he is representing: so-called silent majority, that is why most polls were wrong.

By the way, many have praised my social media analysis just before the election, mine was way better than all the popular polls such as CNN.  This is not by accident, this is power of big data and high tech in the information age:

Final Update of Social Media Sentiment Statistics Before Election

with deep NLP and social media, we can pick up sentiments way more reliable and statistical than the traditional polls, which usually only call 500 to 1000 for opinions to hope they represent 200 million voters.  My mining and analysis are based on millions and millions of data points.  So in future we have to utilize and bring the automatic NLP into things like this as one important indicator of insights and public opinions and sentiments

So in future, we have to utilize and bring NLP into things like this as one important indicator of insights and public opinions and sentiments.

you’re amazing
Your technology is amazing

I got lots of compliments for that, but yours mean the most to me.

What happened in the election as I had been tracking using our NLP sentiment tool was:

1. Clinton was clearly leading in the period after the recording scandal of Trump and before the FBI started reopening Clinton’s email case: Big data mining shows clear social rating decline of Trump last month.

2. Clinton has always been leading in Spanish speaking communities and media, but that did not seem to be sufficient to help revert the case:  Trump sucks in social media big data in Spanish.

3. The event of FBI re-opening the email investigation gave Clinton the most damage: Trump’s scandal was cooling down and the attention was all drawn to Clinton’s email case so that the sentiment has a sharp drop for Clinton (【社煤挖掘:大数据告诉我们,希拉里选情告急】)

4. When FBI finally reissued a statement that there was no evidence to charge Clinton only 2 days before the election, time was too short to remedy the damage FBI did in their first event of reopening the case: my big data tracking found that there was some help but not as significant (【大数据跟踪美大选每日更新,希拉里成功反击,拉川普下水】).

5. Then just before the election, I did a final update of the big data sentiment tracking for the last 24 hours versus last 3 months, and found that Trump had a clear leading status in public opinion and sentiments, so I decided to let the world know it although at the point most everyone believed that Clinton was almost sure to win.

Oh my god dad your machine is the smartest tracker on the market
Dad your system is genius
This is exactly what media needs
You should start your own company
This is amazing
I think this would be the planets smartest machine

I do not disagree, :=)It was a tight competition and with good skills, things could turn different in result.  In terms of popularity votes, they are too to be statistically different, so anything at the right timing could have changed the result.

It was in fact a tight competition and with good skills, things could turn different in result.  In terms of popularity votes, they are too to be statistically different, so anything at the right timing could have changed the result.

On retrospect, FBI did a terrible thing to mess up with the election:
they reopened a case which they did not know the results
just 10 days before the election which made a huge difference.
On the other hand, the recording scandal was released too early
so that although it hurt Trump severely at the time, yet it allowed FBI to revert the attention to Clinton

In future, there should be a strict law disallowing a government agency
which is neutral politically by nature to mess up with an election within a time frame, so Trump’s winning the case to my mind has 80%+ credit from the FBI events.
What a shame





Big data mining shows clear social rating decline of Trump last month

Clinton, 5 years ago. How time flies …








Pulse:实时舆情追踪美国大选,live feed,real time!


Clinton has been mostly leading the social media sentiment :

Screenshots at 4:50pm 11/8/2016:






Again go check our website live on Pulse:






Big data mining shows clear social rating decline of Trump last month

Clinton, 5 years ago. How time flies …









Final Update of Social Media Sentiment Statistics Before Election

Final update before election:


Net sentiment last 24 hours: Trump +7 ; Clinton -9.  The last day analysis of social media.  Buzz:

So contrary to the popular belief, Trump actually is leading in social media just before the election day.

Compare the above with last month ups and downs to put it in larger context:

Last 3 month sentiment: Trump -11; Clinton -18.
Buzz for Trump never fails:


Trump’s Word Clouds:









Clinton’s Word Clouds:



Trump 3-month summary:


Clinton 3-month summary:




理论上讲,只要有一方得到所有选票的23%, 他或她就可能当选





Big data mining shows clear social rating decline of Trump last month

Clinton, 5 years ago. How time flies …








Trump sucks in social media big data in Spanish

As promised, let us get down to the business of big data mining of public opinions and sentiments from Spanish social media on the US election campaign.

We know that in the automated mining of public opinions and sentiments for Trump and Clinton we did before, Spanish-Americans are severely under-represented, with only 8% Hispanic posters in comparison with their 16% in population according to 2010 census (widely believed to be more than 16% today), perhaps because of language and/or cultural barriers.  So we decide to use our multilingual mining tools to do a similar automated survey from Spanish Social Media to complement our earlier studies.

This is Trump as represented in Spanish social media for the last 30 days (09/29-10/29), the key is his social rating as reflected by his net sentiment -33% (in comparison with his rating of -9% in English social media for the same period): way below the freezing point, it really sucks, as also illustrated by the concentration of negative Spanish expressions (red-font) in his word cloud visualization.

By the net sentiment -33%, it corresponds to 242,672 negative mentions vs. 121,584 positive mentions, as shown below. In other words, negative comments are about twice as much as positive comments on Trump in Spanish social media in the last 30 days.

This is the buzz in the last 30 days for Trump: mentions and potential impressions (eye balls): millions of data points and indeed a very hot topic in the social media.

This is the BPI (Brand Passion Index) graph for directly comparing Trump and Clinton for their social ratings in the Spanish social media in the last 30 days:

As seen, there is simply no comparison: to refresh our memory, let us contrast it with the BPI comparison in the English social media:

Earlier in one of my election campaign mining posts on Chinese data, I said, if Chinese only were to vote, Trump would fail horribly, as shown by the big margin in the leading position of Clinton over Trump:

This is even more true based on social media big data from Spanish.

This is the comparison trends of passion intensity between Trump and Clinton:

The visualization by weeks of the same passion intensity data, instead of by days, show even more clearly that people are very passionate about both candidates in the Spanish social media discussions, the intensity of sentiment expressed for Clinton are slightly higher than for Trump:

This is the trends graph for their respective net sentiment, showing their social images in Spanish-speaking communities:

We already know that there is simply no comparison: in this 30-day duration, even when Clinton dropped to its lowest point (close to zero) on Oct 9th, she was still way ahead of Trump whose net sentiment at the time was -40%. In any other time segments, we see an even bigger margin (as big as 40 to 80 points in gap) between the two. Clinton has consistently been leading.

In terms of buzz, Trump generates more noise (mentions) than Clinton consistently, although the gap is not as large as that in English social media:

This is the geo graph, so the social data come from mostly the US and Mexico, some from other Latin America countries and Spain:

Since only the Mexicans in the US may have the voting power, we should exclude media from outside the US to have a clearer picture of how the Spanish-speaking voters may have an impact on this election. Before we do that filtering, we note the fact that Trump sucks in the minds of Mexican people, which is no surprise at all given his irresponsible comments about the Mexican people.

Our social media tool is equipped with geo-filtering capabilities: you can add a geo-fence to a topic to retrieve all social media posts authored from within a fenced location. This allows you to analyze location-based content irrespective of post text. That is exactly what we need in order to do a study for Spanish-speaking communities in the US who are likely to be voters, excluding those media from Mexico or other Spanish-speaking countries. communities in the US who are likely to be voters, excluding those media from Mexico or other countries. This is also needed when we need to do study for those critical swing states to see the true pictures of the likelihood of the public sentiments and opinions in those states that will decide the destiny of the candidates and the future of the US (stay tuned, swing states social media mining will come shortly thanks to our fully automated mining system based on natural language deep parsing).

Now I have excluded Spanish data from outside America, it turned out that the social ratings are roughly the same as before: the reduction of the data does not change the general public opinions from Spanish communities, US or beyond US., US or beyond US. This is US only Spanish social media:

This is summary of Trump for Spanish data within US:

It is clear that Trump’s image truly sucks in the Spanish-speaking communities in the US, communities in the US, which is no surprise and so natural and evident that we simply just confirm and verify that with big data and high-tech now.

These are sentiment drivers (i.e. pros and cons as well as emotion expressions) of Trump :

We might need Google Translate to interpret them but the color coding remains universal: red is for negative comments and green is positive. More red than green means a poor image or social rating.

In contrast, the Clinton’s word clouds involve way more green than red: showing her support rate remains high in the Spanish-speaking communities of the US.

It looks like that the emotional sentiments for Clinton are not as good as Clinton’s sentiment drivers for her pros and cons.

Sources of this study:

Domains of this study:


Did Trump’s Gettysburg speech enable the support rate to soar as claimed?

Big data mining shows clear social rating decline of Trump last month

Clinton, 5 years ago. How time flies …

Automated Suevey

Dr Li’s NLP Blog in English

Did Trump’s Gettysburg speech enable the support rate to soar as claimed?

Last few days have seen tons of reports on Trump’s Gettysburg speech and its impact on his support rate, which is claimed by some of his campaign media to soar due to this powerful speech.  We would love to verify this and uncover the true picture based on big data mining from the social media.

First, here is one link on his speech:

DONALD J. TRUMP DELIVERS GROUNDBREAKING CONTRACT FOR THE AMERICAN VOTER IN GETTYSBURG. (The most widely circulated related post in Chinese social media seems to be this: Trump’s heavyweight speech enables the soaring of the support rate and possible stock market crash).

Believed to be a historical speech in his last dash in the campaign, Trump basically said: I am willing to have a contract with the American people on reforming the politics and making America great again, with this plan outline of my administration in the time frame I promised when I am in office, I will make things happen, believe me.

Trump made the speech on the 22nd this month, in order to mine true public opinions of the speech impact, we can investigate the data around 22nd for the social media automated data analysis.  We believe that automated polling based on big data and language understanding technology is much more revealing and dependable than the traditional manual polls, with phone calls to something like 500 to 1,000 people.  The latter is laughably lacking sufficient data to be trustworthy.


What does the above trend graph tell us?

1  Trump in this time interval was indeed on the rise. The “soaring” claim this time does not entirely come out of nowhere, but, there is a big BUT.

2. BUT, a careful look at the public opinions represented by net sentiment (a measure reflecting the ratio of positive mentions over negative mentions in social media) shows that Trump has basically stayed below the freezing point (i.e. more negative than positive) in this time interval, with only a brief rise above the zero point near the 22nd speech, and soon went down underwater again.

3. The soaring claim cannot withstand scrutiny at all as soaring implies a sharp rise of support after the speech event in comparison with before, which is not the case.

4. The fact is, Uncle Trump’s social media image dropped to the bottom on the 18th (with net sentiment of -20%) of this month.  From 18th to 22nd when he delivered the speech, his net sentiment was steadily on rise from -20% to 0), but  from 22nd to 25th, it no longer went up, but fell back down, so there is no ground for the claim of support soaring as an effect of his speech, not at all.

5. Although not soaring, Uncle Trump’s speech did not lead to sharp drop either, in terms of the buzz generated, this speech can be said to be fairly well delivered in his performance. After the speech, the net sentiment of public opinions slightly dropped, basically maintaining the fundamentals close to zero.

6.  The above big data investigation shows that the media campaign can be very misleading against the objective evidence and real life data.  This is all propaganda, which cannot be trusted at its face value: from so-called “support rate soared” to “possible stock market crash”. Basically nonsense or noise of campaign, and it cannot be taken seriously.

The following figure is a summary of the surveyed interval:


As seen, the average public opinion net-sentiment for this interval is -9%, with positive rating consisting of 2.7 million mentions, and negative rating of 3.2 million mentions.

How do we interpret -9% as an indicator of public opinions and sentiments? According to our previous numerous automated surveys of political figures, this is certainly not a good public opinion rating, but not particularly bad either as we have seen worse.  Basically, -9% is under the average line among politicians reflecting the public image in people’s minds in the social media.  Nevertheless, compared with Trump’s own public ratings before, there is a recorded 13 points jump in this interval, which is pretty good for him and his campaign.  But the progress is clearly not the effect of his speech.

This is the social media statistics on the data sources of this investigation:


In terms of the ratio, Twitter ranks no 1, it is the most dynamic social media on politics for sure, with the largest amount of tweets generated every minute. Among a total of 34.5 million mentions on Trump, Twitter accounted for 23.9 million.  In comparison, Facebook has 1.7 million mentions.

Well, let’s zoom in on the last 30 days instead of only the days around the speech, to provide a bigger background for uncovering the overall trends of this political fight in the 2016 US presidential campaign between Trump and Clinton.


The 30 days range from 9/28-10/28, during which the two lines in the comparison trends chart show the contrast of Trump and Clinton in their respective daily ups and downs of net sentiment (reflecting their social rating trends).  The general impression is that the fight seems to be fairly tight.  Both are so scandal-ridden, both are tough and belligerent.  And both are fairly poor in social ratings.  The trends might look a bit clearer if we visualize the trends data by weeks instead of by day:


No matter how much I dislike Trump, and regardless of my dislike of Clinton whom I have decided to vote anyway in order to make sure the annoying Trump is out of the race,  as a data scientist, I have to rely on data which says that Hillary’s recent situation is not too optimistic: Trump actually at times went a little ahead of Clinton (a troubling fact to recognize and see).


The graph above shows a comparison of the mentions (buzz, so to speak).  In terms of buzz, Trump is a natural topic-king, having generated most noise and comments, good or bad.  Clinton is no comparison in this regard.


The above is a comparison of public opinion passion intensity: like/love or dislike/hate?  The passion intensity for Trump is really high, showing that he has some crazy fans and/or deep haters in the people.  Hillary Clinton has been controversial also and it is not rare that we come across people with very intensified sentiments towards her too.  But still, Trump is sort of political anomaly, and he is more likely to cause fanaticism or controversy than his opponent Hillary.

In his recent Gettysburg speech, Trump highlighted the so-called danger of the election being manipulated. He clearly exaggerated the procedure risks, more than past candidates in history using the same election protocol and mechanism.  By doing so, he paved the way for future non-recognition of the election results. He was even fooling the entire nation by saying publicly nonsense like he would totally accept the election results if he wins: this is not humor or sense of humor, it depicts a dangerous political figure with ambition unchecked.  A very troubling sign and fairly dirty political tricks or fire he is playing with now, to my mind.  Now the situation is, if Clinton has a substantial lead to beat him by a large margin, this old Uncle Trump would have no excuse or room for instigating incidents after the election.  But if it is closer to see-saw, which is not unlikely given the trends analysis we have shown above, then our country might be in some trouble: Uncle Trump and his die-hard fans most certainly will make some trouble.  Given the seriousness of this situation and pressing risks of political turmoil possibly to follow,  we now see quite some people, including some conservative minds, begin to call for the election of Hillary for the sake of preventing Trump from possible trouble making.  I am one with that mind-set too, given that I do not like Hillary either.  If not for Trump, in ordinary elections like this when I do not like candidates of both major parties, I would most likely vote for a third party, or abstain from voting, but this election is different, it is too dangerous as it stands.  It is like a time bomb hidden somewhere in the Trump’s house, totally unpredictable. In order to prevent him from spilling, it is safer to vote for Clinton.

In comparison with my earlier automated sentiment analysis blogged about a week ago (Big data mining shows clear social rating decline of Trump last month),this updated, more recent BPI brand comparison chart seems to be more see-saw: Clinton’s recent campaign seems to be stuck somewhere.


Over the last 30 days, Clinton’s net sentiment rating is -17%, while Trump’s is -19%.  Clinton is only slightly ahead of Trump.  Fortunately, Trump’s speech did not really reverse the gap between the two, which is seen fairly clearly from the following historical trends represented by three different circles in brand comparison (the darker circle represents more recent data): the general trends of Clinton are still there: it started lagging behind and went better and now is a bit stuck, but still leading.



Yes, Clinton’s most recent campaign activities are not making significant progress, despite more resources put to use as shown by bigger darker circle in the graph.  Among the three circles of Clinton, we can see that the smallest and lightest circle stands for the first 10 days of data in the past 30 days, starting obviously behind Trump.  The last two circles are data of the last 20 days, seemingly in situ, although the circle becomes larger, indicating more campaign input and more buzz generated.  But the benefits are not so obvious.  On the other side, Trump’s trends show a zigzag, with the overall trends actual declining in the past 30 days.  The middle ten days, there was a clear rise in his social rating, but the last ten days have been going down back.  Look at Trump’s 30-day social cloud of Word Cloud for pros and cons and Word Cloud for emotions:

Let us have a look at Trump’s 30-day social media sentiment word clouds, the first is more about commenting on his pros and cons, and the second is more direct and emotional expressions on him:sentiment-drivers-38

One friend took a glance at the red font expression “fuck”, and asked: who are subjects and objects of “fuck” here?  In fact, the subject generally does not appear in the social posts, by default it is the poster himself, reflecting part of the general public, the object of “fuck” is, of course, Trump, for otherwise our deep linguistics based system will not count it as a negative mention of trump reflected in the graph.  Let us show some random samples side by side of the graph:


My goodness, the “fuck” mentions account for 5% of the emotional data, the poor old Uncle Trump is fucked 40 million times in social media within one-month duration, showing how this guy is hated by some of the people whom he is supposed to represent and govern if he takes office.   See how they actually express their strong dislike of Trump:

fucking moron
fucking idiot

you name it, to the point even some Republicans also curse him like crazy:

Trump is a fucking idiot. Thank you for ruining the Republican Party you shithead.

Looking at the following figure of popular media, it seems that the most widely circulated political posts in social media involve quite some political video works:


The domains figure below shows that the Tumblr posts on politics contribute more than Facebook:


In terms of demographics background of social media posters, there is a fair balance between male and female: male 52% female 48% (in contrast to Chinese social media where only 25% females are posting political comments on US presidential campaign).  The figure below shows the ethnic background of the posters, with 70% Caucasians, 13% African Americans, 8% Hispanic and 6% Asians.  It looks like that the Hispanic Americans and Asian Americans are under-represented in the English social media in comparison with their due population ratios, as a result, this study may have missed some of their voice (but we have another similar study using Chinese social media, which shows a clear and big lead of Clinton over Trump; given time, we should do another automated survey using our multilingual engine for Spanish social media.  Another suggestion from friends is to do a similar study on swing states because after all these are the key states that will decide the outcome of this election, we can filter the data by locations where posts are from to simulate that study).  There might be a language or cultural reasons for this under-representation.


This last table involves a bit of fun facts of the investigation.  In social media, people tend to talk most about the campaign, on the Wednesday and Sunday evenings, with 9 o’clock as the peak, for example, on the topic of Trump, nine o’clock on Sunday evening generated 1,357,766 messages within one hour.  No wonder there is no shortage of big data from social media on politics.  It is all about big data. In contrast, with the traditional  manual poll, no matter how sampling is done, the limitation in the number of data points is so challenging:
with typically 500 to 1000 phone calls, how can we trust that the poll represents the public opinions of 200 million voters?  They are laughably too sparse in data.  Of course, in the pre-big-data age, there were simply no alternatives to collect public opinion in a timely manner with limited budgets.  This is the beauty of Automatic Survey, which is bound to outperform the manual survey and become the mainstream of polls.


Authors with most followers are:


Most mentioned authors are listed below:


Tell me when in history did we ever have this much data and info, with this powerful data mining capabilities of fully sutomated mining of public opinions and sentiments at scale?




Big data mining shows clear social rating decline of Trump last month

Clinton, 5 years ago. How time flies …

Automated Suevey

Dr Li’s NLP Blog in English



Big data mining shows clear social rating decline of Trump last month

Big data mining from last month’ social media shows clear decline of Trump in comparison with Clinton


Our automatic big data mining for public opinions and sentiments from social media speaks loud and clear: Tump’s social image sucks.

Look at last 30 days of social media on the Hillary and Trump’s social image and standing in our Brand Passion Index (BPI) comparison chart below:


Three points to note:
1 Trump has more than twice buzz than Hillary in terms of social media coverage (the size of the circles indicates the degree of mentions);
2. The intensity of sentiments from the general public of netters is more intense for Chump than for Clinton: the Y-axis shows the passion intensity
3. The social ratings and images of the two are both quite poor, but Trump is more criticized in social: the X-axis of Net Sentiment shows the index social sentiment ratings.  Both are under freezing point (meaning more negative comments than positive).

If we want to automatically investigate the trend of the past month and their social images’ ups and downs, we can have the data segmented into two or three segments.  Figure below shows the trends contrast of the first 15 days of social media data vs. the second 15 days of data in the 30-day period (up to 10/21/2016):


See, in the past month, with the presidential election debates and scandals getting attention, Trump’s media image significantly deteriorated, represented by the public opinion circles shifting from the right on the X-axis to the left side (for dislike or hate sentiments: the lighter circle represents data older than the darker circle).  His social rating was clearly better than Hillary to start with and ended up worse than that of Hillary.  At the same time, Hillary’s social media image has improved, the circle moves a bit from the left to right. Two candidates have always been below the freezing point, clearly shown in the figure, but just a month ago, Clinton was rated even lower than Trump in public opinions of the social media: it is not the people who like Trump that much, but the general public showed more dislike for Hillary for whatever reasons.

As seen, our BPI brand comparison chart attempts to visualize four-dimensional information:
1. net sentiment for social ratings on the X-axis;
2. the passion intensity of public sentiments on the Y-axis;
3. buzz circle size, representing mentions of soundbites;
4. The two circles of the same brands show the coarse-grained time dimension for general trends.

It is not very easy to represent 4 dimensions of analytics in a two-dimensional graph.  Hope the above attempt in our patented visualization efforts is insightful and not confusing.

If we are not happy with the divide-into-two strategy for one month of data to show the trends, how about cut them into three pieces?  Here is the Figure for .three circles in the time dimension.


We should have used different colors for the two political brands to make visualization a bit clearer.  Nevertheless, we see the trends for Clinton in her three circles of social media sentiments shifting from the lower left corner to the upper right in a zigzag path: getting better, then worse, and ended up with somewhere in between at this point (more exactly, up to the point of 10/21/2016). For the same 3 segments of data, Trump’s (brand) image started not bad, then went slightly better, and finally fell into the abyss.

The above is to use our own brand comparison chart (BPI) to decode the two US presidential candidates’ social images change and trends.  This analysis, entirely automated based on deep Natural Language Parsing technology, is supported by data points in a magnitude many times more than the traditional manual polls which are by nature severely restricted in data size and time response.

What are the sources of social media data for the above automated polling?  They are based on random social media sampling of big data, headed by the most dynamic source of Twitter, as shown below.




This is a summary of the public opinions and sentiments:


As seen, it is indeed BIG data: a month of random sampling of social media data involves the mentions of the candidates for nearly 200 million times, a total of up to 3,600+ billion impressions (potential eyeballs). Trump accounted for 70 percent of the buzz while Clinton only 30 percent.

The overall social rating during the period of 09/21/2016 through 10/21/2016, Trump’s net sentiment is minus 20%, and Clinton is minus 18%.  These measures show a rating much lower than that of most other VIP analysis we have done before using the same calculations.  Fairly nasty images, really.   And the big data trends show that Trump sucks most.

The following is some social media soundbites for Trump:

Bill Clinton disgraced the office with the very behavior you find appalling in…
In closing, yes, maybe Trump does suffer from a severe case of CWS.
Instead, in this alternate NY Times universe, Trump’s campaign was falling …
Russian media often praise Trump for his business acumen.
This letter is the reason why Trump is so popular
Trump won
I’m proud of Trump for taking a stand for what’s right.
Kudos to Trump for speaking THE TRUTH!
Trump won
I’m glad I’m too tired to write Trump/Putin fuckfic.
#trump won
Trump is the reason Trump will lose this election.
Trump is blamed for inciting violence.
Breaking that system was the reason people wanted Trump.
I hate Donald Trump for ruining my party.
>>32201754 Trump is literally blamed by Clinton supporters for being too friendly with Russia.
Another heated moment came when Trump delivered an aside in reponse to …
@dka_gannongal I think Donald Trump is a hoax created by the Chinese….
Skeptical_Inquirer The drawing makes Trump look too normal.
I’m proud of Donald Trump for answering that honestly!
Donald grossing me out with his mouth features @smerconish …
Controlling his sniffles seems to have left Trump extraordinarily exhausted
Trump all the way people trump trump trump
Trump wins
Think that posting crap on BB is making Trump look ridiculous.
I was proud of Trump for making America great again tonight.
MIL is FURIOUS at Trump for betraying her!
@realdonaldTrump Trump Cartel Trump Cartel America is already great, thanks to President Obama.
Kudos to Mr Trump for providing the jobs!!
The main reason to vote for Trump is JOBS!
Yes donal trump has angered many of us with his WORDS.
Trump pissed off a lot of Canadians with his wall comments.
Losing this election will make Trump the biggest loser the world has ever seen.
Billy Bush’s career is merely collateral damage caused by Trump’s wrenching ..
So blame Donald for opening that door.
The most important reason I am voting for Trump is Clinton is a crook.
Trump has been criticized for being overly complimentary of Putin.
Kudos to Trump for reaching out to Latinos with some Spanish.
Those statements make Trump’s latest moment even creepier.
I’m mad at FBN for parroting the anti-Trump talking points.
Kudos to Trump for ignoring Barack today @realDonaldTrump
Trump has been criticized for being overly complimentary of Putin.
OT How Donald Trump’s rhetoric has turned his precious brand toxic via …
It’s these kinds of remarks that make Trump supporters look like incredible …
Trump is blamed for inciting ethnic tensions.
Trump is the only reason the GOP is competitive in this race.
Its why Republicans are furious at Trump for saying the voting process is rigged.
Billy Bush’s career is merely collateral damage caused by Trump’s wrenching ..
Donald Trump is the dumbest, worst presidential candidate your country …
I am so disappointed in Colby Keller for supporting Trump.
Billy Bush’s career is merely collateral damage caused by Trump’s wrenching..
In swing states, Trump continues to struggle.
Trump wins
Co-host Jedediah Bila agreed, saying that the move makes Trump look desperate.
Trump wins
“Trump attacks Clinton for being bisexual!”
Pence also praised Trump for apologizing following the tape’s disclosure.
In swing states, Trump continues to struggle.
the reason Trump is so dangerous to the establishment is he is unapologetical..

Here are some public social media soundbites for Clinton in the same period:

Hillary deserves worse than jail.
Congratulations to Hillary & her campaign staff for wining three Presidential ..
I HATE @chicanochamberofcommerce FOR INTRODUCING THAT HILLARY …
As it turns out, Hillary creeped out a number of people with her grin.
Hillary trumped Trump
Trump won!  Hillary lost
Hillary violated the Special Access Program (SAP) for disclosing about the …
I trust Flint water more than Hillary
Hillary continued to baffle us with her bovine feces.
Supreme Court: Hillary is our only choice for keeping LGBT rights.
kudos to hillary for remaining sane, I’d have killed him by now
How is he blaming Hillary for sexually assaulting women. He’s such a shithead
The only reason I’m voting for Hillary is that Donald is the only other choice
Hillary creeps me out with that weird smirk.
Hillary is annoying asf with all of her laughing
I credit Hillary for the Cubs waking up
When you listen to Hillary talk it is really stupid
On the other hand, Hillary Clinton has a thorough knowledge by virtue of …
Americans deserve better than Hillary
Certain family members are also upset with me for speaking out against …
Hillary is hated by all her security detail for being so abusive
Hillary beat trump
The only reason to vote for Hillary is she’s a woman.
Certain family members are also upset with me for speaking out against ….
I am glad you seem to be against Hillary as well Joe Pepe.
Hillary scares me with her acions.
Unfortunately Wikileaks is the monster created by Hillary & democrats.
I’m just glad you’re down with evil Hillary.
Hillary was not mad at Bill for what he did.  She was mad he got caught.  ……
These stories are falling apart like Hillary on 9/11
Iam so glad he is finally admitting this about Hillary Clinton.
Why hate a man for doing nothing like Hillary Clinton
Hillary molested me with a cigar while Bill watched.
You are upset with Hillary for doing the same as all her predecessors.
I feel like Hillary Clinton is God’s punishment on America for its sins.
Trumps beats Hillary
You seem so proud of Hillary for laughing at rape victims.
Of course Putin is going to hate Hillary for publicly announcing false …
Russia is pissed off at Hillary for blaming the for wikileaks!
Hillary will not win.  Good faith is stronger than evil.  Trump wins??
I am proud of Hillary for standing up for what is good in the USA.
Hillarys plans are worse than Obama
Hillary is the nightmare “the people” have created.
Funny how the Hillary supporters are trashing Trump for saying the same …
???????????? I am so proud of the USA for making Hillary Clinton president.
Hillary, you’re a hoax created by the Chinese
Trump trumps Hillary
During the debate, Trump praised Hillary for having the will to fight.
Trump is better person than Hillary
Donald TRUMPED Hillary
Kudos to Hillary for her accomplishments.
He also praised Hillary for handling the situation with dignity.
During the debate, Trump praised Hillary for having the will to fight.
People like Hillary in senate is the reason this country is going downhill.
Hillary did worse than expectations.
Trump will prosecute Hillary for her crimes, TRUMP will!
Have to praise Hillary for keeping her focus.
a landslide victory for Hillary will restore confidence in American democracy ..
I was so proud of Hillary tonight for acting like a tough, independent woman.
I dislike Hillary Clinton, as I think she is a corrupt, corporate shill.
Hillary did worse than Timmy Kaine
Im so glad he finally brought Benghazi against Hillary
Hillary, thank you for confirmation that the Wikileaks documents are authentic
Supreme Court justices is the only reason why I’d vote for Hillary.
Massive kudos to Hillary for keeping her cool with that beast behind her.
Congrats to Hillary for actually answering the questions. She’s spot on. #debate



Social media mining: Did Trump’s Gettysburg speech enable the support rate to soar as claimed?

Big data mining shows clear social rating decline of Trump last month

Clinton, 5 years ago. How time flies …

Automated Suevey

From IBM’s Jeopardy robot, Apple’s Siri, to the new Google Translate

Latest Headline News: Samsung acquires Viv, a next-gen AI assistant built by the creators of Apple’s Siri.

Some people are just smart, or shrewd, more than we can imagine.  I am talking about Fathers of Siri, who have been so successful with their technology that they managed to sell the same type of technology twice, both at astronomical prices, and both to the giants in the mobile and IT industry.  What is more amazing is, the companies they sold their tech-assets to are direct competitors.  How did that happen?  How “nice” this world is, to a really really smart technologist with sharp business in mind.

What is more stunning is the fact that, Siri and the like so far are regarded more as toys than must-carry tools, intended at least for now to satisfy more curiosity than to meet the rigid demand of the market.  The most surprising is that the technology behind Siri is not unreachable rocket science by nature,  similar technology and a similar level of performance are starting to surface from numerous teams or companies, big or small.

I am a tech guy myself, loving gadgets, always watching for new technology breakthrough.  To my mind, something in the world is sheer amazing, taking us in awe, for example, the wonder of smartphones when the iPhone first came out. But some other things in the tech world do not make us admire or wonder that much, although they may have left a deep footprint in history. For example, the question answering machine made by IBM Watson Lab in winning Jeopardy.  They made it into the computer history exhibition as a major AI milestone.  More recently, the iPhone Siri, which Apple managed to put into hands of millions of people first time for seemingly live man-machine interaction. Beyond that accomplishment, there is no magic or miracle that surprises me.  I have the feel of “seeing through” these tools, both the IBM answering robot type depending on big data and Apple’s intelligent agent Siri depending on domain apps (plus a flavor of AI chatbot tricks).

Chek: @ Wei I bet the experts in rocket technology will not be impressed that much by SpaceX either,

Wei: Right, this is because we are in the same field, what appears magical to the outside world can hardly win an insider’s heart, who might think that given a chance, they could do the same trick or better.

The Watson answering system can well be regarded as a milestone in engineering for massive, parallel big data processing, not striking us as an AI breakthrough. what shines in terms of engineering accomplishment is that all this happened before the big data age when all the infrastructures for indexing, storing and retrieving big data in the cloud are widely adopted.  In this regard, IBM is indeed the first to run ahead of the trend, with the ability to put a farm of servers in working for the QA engine to be deployed onto massive data.  But from true AI perspective, neither the Watson robot nor the Siri assistant can be compared with the more-recent launch of the new Google Translate based on neural networks.  So far I have tested using this monster to help translate three Chinese blogs of mine (including this one in making), I have to say that I have been thrown away by what I see.  As a seasoned NLP practitioner who started MT training 30 years ago, I am still in disbelief before this wonder of the technology showcase.

Chen: wow, how so?

Wei:  What can I say?  It has exceeded my imagination limit for all my dreams of what MT can be and should be since I entered this field many years ago.  While testing, I only needed to do limited post-editing to make the following Chinese blogs of mine presentable and readable in English, a language with no kinship whatsoever with the source language Chinese.

Question answering of the past and present

Introduction to NLP Architecture

Hong: Wei seemed frightened by his own shadow.Chen:

Chen:  The effect is that impressive?

Wei:  Yes. Before the deep neural-nerve age, I also tested and tried to use SMT for the same job, having tried both Google Translate and Baidu MT, there is just no comparison with this new launch based on technology breakthrough.  If you hit their sweet spot, if your data to translate are close to the data they have trained the system on, Google Translate can save you at least 80% of the manual work.  80% of the time, it comes so smooth that there is hardly a need for post-editing.  There are errors or crazy things going on less than 20% of the translated crap, but who cares?  I can focus on that part and get my work done way more efficiently than before.  The most important thing is, SMT before deep learning rendered a text hardly readable no matter how good a temper I have.  It was unbearable to work with.  Now with this breakthrough in training the model based on sentence instead of words and phrase, the translation magically sounds fairly fluent now.

It is said that they are good a news genre, IT and technology articles, which they have abundant training data.  The legal domain is said to be good too.  Other domains, spoken language, online chats, literary works, etc., remain a challenge to them as there does not seem to have sufficient data available yet.

Chen: Yes, it all depends on how large and good the bilingual corpora are.

Wei:  That is true.  SMT stands on the shoulder of thousands of professional translators and their works.  An ordinary individual’s head simply has no way in  digesting this much linguistic and translation knowledge to compete with a machine in efficiency and consistency, eventually in quality as well.

Chen: Google’s major contribution is to explore and exploit the existence of huge human knowledge, including search, anchor text is the core.

Ma: I very much admire IBM’s Watson, and I would not dare to think it possible to make such an answering robot back in 2007.

Wei: But the underlying algorithm does not strike as a breakthrough. They were lucky in targeting the mass media Jeopardy TV show to hit the world.  The Jeopardy quiz is, in essence, to push human brain’s memory to its extreme, it is largely a memorization test, not a true intelligence test by nature.  For memorization, a human has no way in competing with a machine, not even close.  The vast majority of quiz questions are so-called factoid questions in the QA area, asking about things like who did what when and where, a very tractable task.  Factoid QA depends mainly on Named Entity technology which was mature long ago, coupled with the tractable task of question parsing for identifying its asking point, and the backend support from IR, a well studied and practised area for over 2 decades now.  Another benefit in this task is that most knowledge questions asked in the test involve standard answers with huge redundancy in the text archive expressed in various ways of expressions, some of which are bound to correspond to the way question is asked closely.  All these factors contribute to IBM’s huge success in its almost mesmerizing performance in the historical event.  The bottom line is, shortly after the 1999 open domain QA was officially born with the first TREC QA track, the technology from the core engine has been researched well and verified for factoid questions given a large corpus as a knowledge source. The rest is just how to operate such a project in a big engineering platform and how to fine-tune it to adapt to the Jeopardy-style scenario for best effects in the competition.  Really no magic whatsoever.

Google Translated from【泥沙龙笔记:从三星购买Siri之父的二次创业技术谈起】, with post-editing by the author himself.



Question answering of the past and present

Introduction to NLP Architecture

Newest GNMT: time to witness the miracle of Google Translate

Dr Li’s NLP Blog in English


Newest GNMT: time to witness the miracle of Google Translate


Recently, the microblogging (wechat) community is full of hot discussions and testing on the newest annoucement of the Google Translate breakthrough in its NMT (neural network-based machine translation) offering, claimed to have achieved significant progress in data quality and readability.  Sounds like a major breakthrough worthy of attention and celebration.

The report says:

Ten years ago, we released Google Translate, the core algorithm behind this service is PBMT: Phrase-Based Machine Translation.  Since then, the rapid development of machine intelligence has given us a great boost in speech recognition and image recognition, but improving machine translation is still a difficult task.

Today, we announced the release of the Google Neural Machine Translation (GNMT) system, which utilizes state-of-the-art training techniques to maximize the quality of machine translation so far. For a full review of our findings, please see our paper “Google`s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation.”A few years ago, we began using RNN (Recurrent Neural Networks) to directly learn the mapping of an input sequence (such as a sentence in a language) to an output sequence (the same sentence in another language). The phrase-based machine learning (PBMT) breaks the input sentences into words and phrases, and then largely interprets them independently, while NMT interprets the entire sentence of the input as the basic unit of translation .

A few years ago, we began using RNN (Recurrent Neural Networks) to directly learn the mapping of an input sequence (such as a sentence in a language) to an output sequence (the same sentence in another language).  The phrase-based machine learning (PBMT) breaks the input sentences into words and phrases, and then largely interprets them independently, while NMT interprets the entire sentence of the input as the basic unit of translation .The advantage of this approach is that compared to the previous phrase-based translation system, this method requires less engineering design. When it was first proposed, the accuracy of the NMT on a medium-sized public benchmark

The advantage of this approach is that compared to the previous phrase-based translation system, this method requires less engineering design. When it was first proposed, the accuracy of the NMT on a medium-sized public benchmark data set was comparable to that of a phrase-based translation system.  Since then, researchers have proposed a number of techniques to improve NMT, including modeling external alignment models to handle rare words, using attention to align input and output words, and word decomposition into smaller units to cope with rare words. Despite these advances, the speed and accuracy of NMT has not been able to meet the requirements of a production system such as Google Translate.  Our new paper describes how to overcome many of the challenges of making NMT work on very large data sets and how to build a system that is both fast and accurate enough to deliver a better translation experience for Google users and services.


Using side-by-side comparisons of human assessments as a standard, the GNMT system translates significantly better than the previous phrase-based production system.  With the help of bilingual human assessors, we found in sample sentences from Wikipedia and the news website that GNMT reduced translational errors by 55% to 85% or more in the translation of multiple major pairs of languages.

In addition to publishing this research paper today, we have also announced that GNMT will be put into production in a very difficult language pair (Chinese-English) translation.

Now, the Chinese-English translations of the Google Translate for mobile and web versions have been translated at 100% using the GNMT machine – about 18 million translations per day.  GNMT’s production deployment uses our open machine learning tool suite TensorFlow and our Tensor Processing Units (TPUs), which provide sufficient computational power to deploy these powerful GNMT models, meeting Google Translate strict latency requirements for products.

Chinese-to-English translation is one of the more than 10,000 language pairs supported by Google Translate. In the coming months, we will continue to extend our GNMT to far more language pairs.

GNMT translated from Google Translate achieves a major breakthrough!

As an old machine translation researcher, this temptation cannot be resisted.  I cannot wait to try this latest version of the Google Translate for Chinese-English.
Previously I tried Google Chinese-to-English online translation multiple times, the overall quality was not very readable and certainly not as good as its competitor Baidu.  With this newest breakthrough using deep learning with neural networks, it is believed to get close to human translation quality.  I have a few hundreds of Chinese blogs on NLP, waiting to be translated as a try.  I was looking forward to this first attempt in using Google Translate for my Science Popularization blog titled Introduction to NLP Architecture.  My adventure is about to start.  Now is the time to witness the miracle, if miracle does exist.

I hope you will not be disappointed.  I have jokingly said before: the rule-based machine translation is a fool, the statistical machine translation is a madman, and now I continue to ridicule: neural machine translation is a “liar” (I am not referring to the developers behind NMT).  Language is not a cat face or the like, just the surface fluency does not work, the content should be faithful to the original!

Let us experience the magic, please listen to this translated piece of my blog:

This is my Introduction to NLP Architecture fully automatically translated by Google Translate yesterday (10/2/2016) and fully automatically read out without any human interference.  I have to say, this is way beyond my initial expectation and belief.

Listen to it for yourself, the automatic speech generation of this science blog of mine is amazingly clear and understandable. If you are an NLP student, you can take it as a lecture note from a seasoned NLP practitioner (definitely clearer than if I were giving this lecture myself, with my strong accent). The original blog was in Chinese and I used the newest Google Translate claimed to be based on deep learning using sentence-based translation as well as character-based techniques.

Prof. Dong, you know my background and my original doubtful mindset. However, in the face of such a progress, far beyond our original imagination limits for automatic translation in terms of both quality and robustness when I started my NLP career in MT training 30 years ago, I have to say that it is a dream come true in every sense of it.

In their terminology, it is “less adequate, but more fluent.” Machine translation has gone through three paradigm shifts. When people find that it can only be a good information processing tool, and cannot really replace the human translation, they would choose the less costly.

In any case, this small test is revealing to me. I am still feeling overwhelmed to see such a miracle live. Of course, what I have just tested is the formal style, on a computer and NLP topic, it certainly hit its sweet spot with adequate training corpus coverage. But compared with the pre-NN time when I used both Google SMT and Baidu SMT to help with my translation, this breakthrough is amazing. As a senior old school practitioner of rule-based systems, I would like to pay deep tribute to our “nerve-network” colleagues. These are a group of extremely genius crazy guys. I would like to quote Jobs’ famous quotation here:

“Here’s to the crazy ones. The misfits. The rebels. The troublemakers. The round pegs in the square holes. The ones who see things differently. They’re not fond of rules. And they have no respect for the status quo. You can quote them, disagree with them, glorify or vilify them. About the only thing you can’t do is ignore them. Because they change things. They push the human race forward. And while some may see them as the crazy ones, we see genius. Because the people who are crazy enough to think they can change the world, are the ones who do.”

@Mao, this counts as my most recent feedback to the Google scientists and their work. Last time, about a couple of months ago when they released their parser, proudly claimed to be “the most accurate parser in the world”, I wrote a blog to ridicule them after performing a serious, apples-to-apples comparison with our own parser. This time, they used the same underlying technology to announce this new MT breakthrough with similar pride, I am happily expressing my deep admiration for their wonderful work. This contrast of my attitudes looks a bit weird, but it actually is all based on facts of life. In the case of parsing, this school suffers from lacking naturally labeled data which they would make use of in perfecting the quality, especially when it has to port to new domains or genres beyond the news corpora. After all, what exists in the language sea involves corpora of raw text with linear strings of words, while the corresponding parse trees are only occasional, artificial objects made by linguists in a limited scope by nature (e.g. PennTree, or other news-genre parse trees by the Google annotation team). But MT is different, it is a unique NLP area with almost endless, high-quality, naturally-occurring “labeled” data in the form of human translation, which has never stopped since ages ago.

Mao: @wei That is to say, you now embrace or endorse a neuron-based MT, a change from your previous views?

Yes I do embrace and endorse the practice. But I have not really changed my general view wrt the pros and cons between the two schools in AI and NLP. They are complementary and, in the long run, some way of combining the two will promise a world better than either one alone.

Mao: What is your real point?

Despite biases we are all born with more or less by human nature, conditioned by what we have done and where we come from in terms of technical background, we all need to observe and respect the basic facts. Just listen to the audio of their GSMT translation by clicking the link above, the fluency and even faithfulness to my original text has in fact out-performed an ordinary human translator, in my best judgment. If an interpreter does not have sufficient knowledge of my domain, if I give this lecture in a classroom, and ask an average interpreter to translate on the spot for me, I bet he will have a hard time performing better than the Google machine listed above (of course, human translation gurus are an exception). This miracle-like fact has to be observed and acknowledged. On the other hand, as I said before, no matter how deep the learning reaches, I still do not see how they can catch up with the quality of my deep parsing in the next few years when they have no way of magically having access to a huge labeled data of trees they depend on, especially in the variety of different domains and genres. They simply cannot “make bricks without straw” (as an old Chinese saying goes, even the most capable housewife can hardly cook a good meal without rice). Because in the natural world, there are no syntactic trees and structures for them to learn from, there are only linear sentences. The deep learning breakthrough seen so far is still mainly supervised learning, which has almost an insatiable appetite for massive labeled data, forming its limiting knowledge bottleneck.

Mao: I’m confused. Which one do you believe stronger? Who is the world’s No. 0?

Parsing-wise, I am happy to stay as No. 0 if Google insists on their being No. 1 in the world. As for MT, it is hard to say, from what I see, between their breakthrough and some highly sophisticated rule-based MT systems out there. But what I can say is, at a high level, the trends of the mainstream statistical MT winning the space both in the industry as well as in academia over the old school rule-based MT are more evident today than before.  This is not to say that the MT rule system is no longer viable, or going to an end. There are things which SMT cannot beat rule MT. For examples, certain types of seemingly stupid mistakes made by GNMT (quite some laughable examples of totally wrong or opposite translation have been illustrated in this salon in the last few days) are almost never seen in rule-based MT systems.

here is my try of GNMT from Chinese to English:


Learning, the second of a watershed, the number of subjects significantly significantly, learning methods have also changed, some students can adjust to adapt to changes in progress, progress quickly, from the middle to rise to outstanding. But there are some students there is Fear of hard feelings, the mind used in the study, the rapid decline in performance, loss of interest in learning, self-abandonment, since the devastated, so the students often difficult to break through the third day,

Mao: This translation cannot be said to be good at all.

Right, that is why it calls for an objective comparison to answer your previous question. Currently, as I see, the data for the social media and casual text are certainly not enough, hence the translation quality of online messages is still not their forte.  As for the previous textual sample Prof. Dong showed us above, Mao said the Google translation is not of good quality as expected. But even so, I still see impressive progress made there. Before the deep learning time, the SMT results from Chinese to English is hardly readable, and now it can generally be read loud to be roughly understood. There is a lot of progress worth noting here.

In the fields with big data, in recent years, DL methods are by leaps and bounds. I know a number of experts who used to be biased against DL have changed their views when seeing the results. However, DL in the IR field is still basically not effective so far, but there are signs of slowly penetrating IR.

The key to NMT is “looking nice”. So for people who do not understand the original source text, it sounds like a smooth translation. But isn’t it a “liar” if a translation is losing its faithfulness to the original? This is the Achille’s heel of NMT.

Ma: @Dong, I think all statistical methods have this aching point.

Indeed, there are respective pros and cons. Today I have listened to the Google translation of my blog three times and am still amazed at what they have achieved. There are always some mistakes I can pick here and there. But to err is human, not to say a machine, right? Not to say the community will not stop advancing and trying to correct mistakes. From the intelligibility and fluency perspectives, I have been served super satisfactorily today. And this occurs between two languages without historical kinship whatsoever.

Some leading managers said to me years ago, “In fact, even if machine translation is only 50 percent correct, it does not matter. The problem is that it cannot tell me which half it cannot translate well. If it can, I can always save half the labor, and hire a human translator to only translate the other half.” I replied that I am not able to make a system do that. Since then I have been concerned about this issue, until today when there is a lot of noise of MT replacing the human translation anytime from now. It’s kinda like having McDonald’s then you say you do not need a fine restaurant for French delicacy. Not to mention machine translation today still cannot be compared to McDonald’s. Computers, with machine translation and the like, are in essence a toy given by God for us human to play with. God never agrees to permit us to be equipped with the ability to copy ourselves.

Why GNMT first chose language pairs like Chinese-to-English, not the other way round to showcase? This is very shrewd of them. Even if the translation is wrong or missing the points, the translation is usually fluent at least in this new model, unlike the traditional model who looks and sounds broken, silly and erroneous. This is the characteristics of NMT, it is selecting the greatest similarity in translation corpus. As a vast number of English readers do not understand Chinese, it is easy to impress them how great the new MT is, even for a difficult language pair.

Correct. A closer look reveals that this “breakthrough” lies more on fluency of the target language than the faithfulness to the source language, achieving readability at cost of accuracy. But this is just a beginning of a major shift. I can fully understand the GNMT people’s joy and pride in front of a breakthrough like this. In our career, we do not always have that type of moment for celebration.

Deep parsing is the NLP’s crown. Yet to see how they can beat us in handling domains and genres lacking labeled data. I wish them good luck and the day they prove they make better parsers than mine would be the day of my retirement. It does not look anything like this day is drawing near, to my mind. I wish I were wrong, so I can travel the world worry-free, knowing that my dream has been better realized by my colleagues.

Thanks to Google Translate at https://translate.google.com/ for helping to translate this Chinese blog into English, which was post-edited by myself. 



Wei’s Introduction to NLP Architecture Translated by Google


NLP White Paper: Overview of Our NLP Core Engine

Introduction to NLP Architecture

It is untrue that Google SyntaxNet is the “world’s most accurate parser”

Announcing SyntaxNet: The World’s Most Accurate Parser Goes Open

Is Google SyntaxNet Really the World’s Most Accurate Parser?

Dr Li’s NLP Blog in English

Introduction to NLP Architecture

(translated by Google Translate, post-edited by myself)

For the natural language processing (NLP) and its applications, the system architecture is the core issue.  In my blog (  OVERVIEW OF NATURAL LANGUAGE PROCESSING), I sketched four NLP system architecture diagrams, now to be presented one by one .

In my design philosophy, an NLP process is divided into four stages, from the core engine up to the applications, as reflected in the four diagrams.  At the bottom is deep parsing, following the bottom-up processing of an automatic sentence analyzer.  This work is the most difficult, but it is the foundation and enabling technology for vast majority of NLP systems.


The purpose of parsing is to structure unstructured text.  Facing the ever-changing language, only when it is structured in some logical form can we formulate patterns for the information we like to extract to support applications.  This principle of linguistics structures began to be the consensus in the linguistics community when Chomsky proposed the transformation from surface structure to deep structure in his linguistic revolution of 1957.  A tree representing the logical form does not only involve arcs that express syntactic-semantic relationships, but also contain the nodes of words or phrases that carry various conceptual information.  Despite the importance of such deep trees, generally they do not directly support an NLP product.  They remain only the internal representation of the parsing system, as a result of language analysis and understanding before its semantic grouding to the applications as their core support.


The next layer after parsing is the extraction layer, as shown in the above diagram.  Its input is the parse tree, and the output is the filled-in content of templates, similar to filling in a form: that is the information needed for the application, a pre-defined table (so to speak), so that the extraction system can fill in the blanks by the related words or phrases extracted from text based on parsing. This layer has gone from the original domain-independent parser into the application-oriented and product-demanded tasks.

It is worth emphasizing that the extraction layer is geared towards the domain-oriented semantic focus, while the previous parsing layer is domain-independent.  Therefore, a good framework is to do a very thorough analysis of logic semantics in deep parsing, in order to reduce the burden of information extraction.  With the depth of the analysis in  the logical semantic structures to support the extraction, a rule at extraction layer is in essence equivalent to thousands of surface rules at linear text layer.  This creates the conditions for the efficient porting to new domains based on the same core engine of parsing.

There are two types of extraction, one is the traditional information extraction (IE), the extraction of facts or objective information: named entities, the relationships between entities, and events involving entities (which can answer questions like “who did what when and where” and the like).  This extraction of objective information is the core technology and foundation for the knowledge graph (nowadays such a hot area in industry).  After completion of IE, the next layer of information fusion (IF) is aimed at constructing the knowledge graph.   The other type of extraction is about subjective information, for example, the public opinion mining is based on this kind of extraction. What I have done over the past five years as my focus is along this line for fine-grained extraction of public opinions (not just sentiment classification, but also to explore the reasons behind the public opinions and sentiments to provide the insights basis for decision-making).  This is one of the hardest tasks in NLP, much more difficult than IE for objective information.  Extracted information is usually stored in a database. This provides huge textual mentions of information to feed the underlying mining layer.

Many people confuse information extraction and text mining, but, in fact, they are two levels of different tasks.  Extraction faces each individual language tree, embodied in each sentence, in order to find the information we want.  The mining, however, faces a corpus, or data sources as a whole, from the language forest for gathering statistically significant insights.  In the information age, the biggest challenge we face is information overload, we have no way to exhaust the information ocean for the insights we need, therefore, we must use the computer to dig out the information from the ocean for the required critical intelligence to support different applications. Therefore, mining relies on natural statistics, without statistics, the information is still scattered across the corpus even if it is identified.  There is a lot of redundancy in the extracted mentions of information, mining can integrate them into valuable insights.


Many NLP systems do not perform deep mining, instead, they simply use a query to search real-time from the extracted information index in the database and merge the retrieved information on-the-fly, presenting the top n results to the user. This is actually also mining, but it is a way of retrieval to achieve simple mining for directly supporting an application.

In order to do a good job of mining, there is a lot of work that can be done in this mining layer. Text mining not only improves the quality of existing extracted information pieces, moreover, it can also tap the hidden information, that is not explicitly expressed in the data sources, such as the causal relationship between events, or statistical trends of the public opinions or behaviours. This type of mining was first done in the traditional data mining applications as the traditional mining was aimed at structured data such as transaction records, making it easy to mine implicit associations (e.g., people who buy diapers often buy beer, this reflects the common behaviours of young fathers of the new-born, and such hidden association can be mined to optimize the layout and sales of goods). Nowadays, natural language is also structured thanks to deep parsing, hence data mining algorithms for hidden intelligence in the database can, in principle, also be applied to enhance the value of intelligence.

The fourth architectural diagram is the NLP application layer. In this layer, the results from parsing, extraction, and mining out of the unstructured text sources can be used to support a variety of NLP products and services, ranging from the QA (question answering) systems to the dynamic construction of the knowledge graph (this type of graph is visualized now in the Google search when we do a search for a star or VIP), from automatic polling of public opinions to customer intelligence about brands, from intelligent assistants (e.g. chatbots, Siri etc.) to automatic summarization and so on.


This is my overall presentation of the basic architecture of NLP and its applications, based on nearly 20 years of experiences in the industry to design and develop NLP products.  About 18 years ago, I was presenting a similar diagram of the NLP architecture to the first venture investor who told us that this is a million dollar slide.  The presentation here is a natural inheritance and extension from that diagram.

Here is the previously mentioned million-dollar slide story.  Under the Clinton’s administration before the turn of the century, the United States went through a “great leap forward” of the Internet technology, known as Dot Com Bubble, a time of hot money pouring into the IT industry while all kinds of Internet startups were springing up.  In such a situation, my boss decided to seek venture capital for the business expansion, and requested me to illustrate our prototype of the implemented natural language system for its introduction.  I then drew the following three-tier structure of an NLP system diagram: the bottom layer is parsing, from shallow to deep, the middle is built on parsing for information extraction, and the top layer illustrates some major categories of NLP applications, including QA.  Connecting applications and the downstairs two layers of language processing is the database, used to store the results of information extraction, ready to be applied at any time to support upstairs applications.  This general architecture has not changed much since I made it years ago, although the details and layout have been redrawn no less than 100 times.  The architecture diagram below is about one of the first 20 editions, involving mainly the backend core engine of information extraction architecture, not so much on the front-end flowchart for the interface between applications and the database.  I still remember early in the morning, my boss sent the slide to a Wall Street angel investor, by noon we got his reply, saying that he was very interested.  Less than two weeks, we got the first million dollar angel investment check.  Investors label it as a million dollar slide, which is believed to have not only shown the depth of language technology but also shows the great potential for practical applications.


Pre-Knowledge Graph: Architecture of Information Extraction Engine


【Related Chinese Blogs】

NLP Overview

Pre-Knowledge Graph: The Architecture of Information Extraction Engine

Natural language parser is to reveal the mystery of the language like a LIGO-type detector

Dream come true

( translated from http://blog.sciencenet.cn/blog-362400-981742.html )

The speech generation of the fully automatically translated, un-edited science blog of mine is attached below (for your entertainment :=), it is amazingly clear and understandable (definitely clearer than if I were giving this lecture myself with my strong accent).  If you are an NLP student, you can listen to it as a lecture note from a seasoned NLP practitioner.

Thanks to the newest Google Translate service from Chinese into English at https://translate.google.com/ 




Wei’s Introduction to NLP Architecture Translated by Google


NLP White Paper: Overview of Our NLP Core Engine

Not an ad. But a historical record.

Although not updated for long, this wiki remains like this until today 9/28/2016
from https://en.wikipedia.org/wiki/NetBase_Solutions,_Inc.


NetBase Solutions, Inc.

From Wikipedia, the free encyclopedia
  (Redirected from NetBase)
NetBase Solutions, Inc.
Industry Market Research
Founded 2004
Founder Jonathan Spier and Michael Osofsky
Headquarters Mountain View, CA, USA
Area served
Key people
Peter Caswell, CEO
Mark Bowles, CTO
Lisa Joy Rosner, CMO
Dr. Wei Li, Chief Scientist
Products NetBase Insight Workbench
Website www.netbase.com

NetBase Solutions, Inc. is a Mountain View, CA based developer of natural language processing technology used to analyze social media and other web content. It was founded by two engineers from Ariba in 2004 as Accelovation, before changing names to NetBase in 2008. It has raised a total of $21 million in funding. It’s sold primarily on a subscription basis to large companies to conduct market research and social media marketing analytics. NetBase has been used to evaluate the top reasons men wear stubble, the products Kraft should develop and the favorite tech company based on digital conversations.


NetBase was founded by Jonathan Spier and Michael Osofsky, both of whom were engineers at Ariba, in 2004 as Accelovation, based on the combination of the words “acceleration” and “innovation.”[1][2] It raised $3 million in funding in 2005, followed by another $4 million in 2007.[1][3] The company changed its name to NetBase in February 2008.[4][5]

It developed its analytics tools in March 2010 and began publishing monthly brand passion indexes (BPI) comparing brands in a market segment using the tool shortly afterwards.[6] In 2010 it raised $9 million in additional funding and another $2.5 million in debt financing.[1][3] NetBase Insight Workbench was released in March 2011 and a partnership was formed with SAP AG that December for SAP to resell NetBase’s software.[7] In April 2011, a new CEO Peter Caswell was appointed.[8] Former TIBCO co-inventor, patent author and CTO Mark Bowles is now the CTO at NetBase and held responsible for many technical achievements in scalability.[9]

Software and services

Screenshot of NetBase Insight Workbench dashboard

NetBase sells a tool called NetBase Insight Workbench that gives market researchers and social marketers a set of analytics, charts and research tools on a subscription basis. ConsumerBase is what the company calls the back-end that collects and analyzes the data. NetBase targets market research firms and social media marketing departments, primarily at large enterprises with a price-point of around $100,000.[10][11] NetBase is also white-labeled by Reed Elsevier in a product called illumin8.[12]


For the average NetBase user, 12 months of activity is twenty billion sound bytes from just over seven billion digital documents. The company claims to index 50,000 sentences a minute from sources like public-facing Facebook, blogs, forums, Twitter and consumer review sites.[13][14]

According to a story in InformationWeek, Kraft uses NetBase to measure customer needs and conduct market research for new product ideas.[15] In 2011 the company released a report based on 18 billion postings over twelve months on the most loved tech companies. Salesforce.com, Cisco Systems and Netflix were among the top three.[16] Also in 2011, NetBase found that the news of Osama Bin Laden eclipsed the royal wedding and the Japan earthquake in online activity.[17]

External links


  1. ^ Jump up to:a b c By Matt Marshall, VentureBeat. “Accelovation Raises $4M for online software for IT market research.” December 3, 2007.
  2. Jump up^ BusinessWeek profile
  3. ^ Jump up to:a b By Jon Xavier, BizJournals. “NetBase filters social media for what clients need to know.” June 3, 2011.
  4. Jump up^ By Barbara Quint, Information Today. “Elsevier and NetBase Launch illumin8.” February 28, 2008.
  5. Jump up^ The Economist. “Improving Innovation.” February 29, 2008.
  6. Jump up^ By Rachael King, BusinessWeek. “Most Loved — And Hated — Tech Companies.”
  7. Jump up^ Darrow, Barb (December 12, 2011). “SAP taps NetBase for deep social media analytics”. GigaOm. Retrieved May 8, 2012.
  8. Jump up^ San Jose Mercury News. “People on the Move.” May 15, 2011.
  9. Jump up^ By David F. Carr, InformationWeek. “How Much is your Brand Loved (or Hated)?” June 16, 2011.
  10. Jump up^ By Eric Schoenfeld, TechCrunch. “NetBase Offers Powerful Semantic Indexing Platform That Reads The Web.” April 22, 2009.
  11. Jump up^ By Jon Xavier, BizJournals. “NetBase filters social media for what clients need to know.” June 3, 2011.
  12. Jump up^ By Barbara Quint, Newsbreak. “Elsevier and NetBase Launch illumin8.” February 28, 2008.
  13. Jump up^ By Neil Glassman, Social Times. “What Every Social Media Marketer Should Know About NetBase.” August 24, 2010.
  14. Jump up^ By Ryan Flinn, BusinessWeek. “Wanted: Social Media Sifters.” October 21, 2010.
  15. Jump up^ By David F. Carr, InformationWeek. “How Kraft Foods Listens to Social Media.” June 30, 2011.
  16. Jump up^ By Ryan Flinn, Bloomberg. “Tech companies measure online sentiment.” May 19, 2011.
  17. Jump up^ By Geoffrey Fowler and Alexandra Berzon, Wall Street Journal. “Social Media Buzzes, Comes Into Its Own.” May 2, 2011.

Who we are. Not an ad, but a snapshot.





We are uniquely positioned to help global businesses create real business value from the unprecedented level of growth opportunities presented each day by social media. We have the industry’s fastest and most accurate social analytics platform, strong partnerships with companies like Twitter, DataSift, and Tumblr, and award-winning patented language technology.

We empower brands and agencies to make the smartest business decisions grounded on the deepest and most reliable consumer insights from social. We’ve grown 300 percent year-over-year and excited to see revenue grow by 4,000% since the second quarter of 2012.


We were recently named a top rated social media management platform by software users on TrustRadius and a market leader by G2 Crowd.


“NetBase is one of the strongest global social listening and analytics tools in the market. Their new interface makes customized dashboard creation a breeze.”

– Omri Duek, Coca-Cola

“Data reporting is both broad and detailed, with the ability to drill down from annual data to hourly data. NetBase allows us to have a pulse on the marketplace in just a few minutes.”

– Susie Thomas, VP, Palisades Media Group

“We started with a gen one solution, but then found that we needed to move to a tool with a better accuracy that could support digital strategy and insights research. NetBase satisfied all our needs.”

– Jared Degnan, Director of Digital Strategy

“As one of the first brands to test NetBase Audience 3D for our Mobile App launch, we’ve found that we could engage with our consumers on a deeper, more human level that further drives them to be brand champions.”

– Mihir Minawala, Manager of Social, Industry & Competitive Intelligence, Taco Bell


We work with executives from forward-looking agencies and leading brands across all verticals in over 99 countries. Our customers use NetBase for real-time consumer insights across the organization, from brand and digital marketing, public relations, product management to customer care.


  • March 2003
    Founded by Michael Osofsky at MIT. Later joined by Wei Li, Chief NetBase Scientist
  • July 2009
    P&G, Coca-Cola and Kraft signed as first customers of NetBase
  • January 2014
    Named Best-in-Class By Consumer Goods Technology
  • April 2014
    Launched Brand Live Pulse, the first real-time view of brands’ social movements
  • May 2014
    Celebrated 10 years with 500% customer growth in 3 years
  • January 2015
    AdAge Names 5 NetBase Customers to the Agency A-List
  • March 2015
    Introduced Audience 3D, the first ever 3D view of audiences
  • April 2015
    Raised $33 MM in Series E Round
  • November 2015
    Named Market Leader by G2 Crowd. Earned Top Ratings by Trust Radius


What inspired you to join NetBase?

It was exciting to build the technology that could quickly surface meaningful customer insights at scale. For example, what used to take a day to run a simple analysis now takes just a second. Our platform now analyzes data in “Google time”, yet the depth and breadth of our analysis is exponentially deeper and larger than what you’ll ever get from a Google search.

What are you most proud of at NetBase?

I’m especially proud that we have the industry’s most accurate, deepest, fastest, and more granular text analysis technology. This enables us to gives our customers very actionable insights, unlike other platforms that offer broad sentiment analysis and general trending topics. Plus, NetBase reads 42 languages. Other platforms don’t even come close. We are customer-centric. Our platform truly helps customers quickly identify their priorities and next steps. This is what sets us apart.

What is the next frontier for NetBase?

With the exploding growth of social and mobile data and new social networks emerging, we’ll be working on connecting all these data points to help our customers get even more out of social data. As Chief Scientist, I’m more excited than ever to develop a “recipe” that can work with the world’s languages and further expand our language offerings.


NetBase Solutions, Inc  © 2016

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

Interaction of syntax and semantics in parsing Chinese transitive verb patterns

Interaction of syntax and semantics in parsing Chinese transitive verb patterns *
(old paper in Proceedings of International Chinese Computing Conference, ICCC’96)

Wei  LI

Department of Linguistics, Simon Fraser University
Burnaby, B.C. V5A 1S6 CANADA (email: lio@sfu.ca)

Keywords: Chinese processing, transitive pattern, syntax, semantics, lexical rule, HPSG


This paper addresses the problem of parsing Chinese transitive verb patterns (including the BA construction and the BEI construction) and handling the related phenomena of semantic deviation (i.e. the violation of the semantic constraint).

We designed a syntax-semantics combined model of Chinese grammar in the framework of Head-driven Phrase Structure Grammar [Pollard & Sag 1994]. Lexical rules are formulated to handle both the transitive patterns which allow for semantic deviation and the patterns which disallow it. The lexical rules ensure the effective interaction between the syntactic constraint and the semantic constraint in analysis.

The contribution of our research can be summarized as:

(1) the insight on the interaction of syntax and semantics in analysis;
(2) a proposed lexical rule approach to semantic deviation based on (1);
(3) the application of (2) to the study of the Chinese transitive patterns;
(4) the implementation of (3) in an unification-based Chinese HPSG prototype.

  1. Background

When Chomsky proposed his Syntactic Structures in Fifties, he seemed to indicate that syntax should be addressed independently of semantics. As a convincing example, he presented a famous sentence:

1)             Colorless green ideas sleep furiously.

Weird as it sounds, the grammaticality of this sentence is intuitively acknowledged: (1) it follows the English syntax; (2) it can be interpreted. In fact, there is only one possible interpretation, solely decided by its syntactic structure. In other words, without the semantic interference, our linguistic knowledge about the English syntax is sufficient to assign roles to each constituent to produce a reading although the reading does not seem to make sense.

However, things are not always this simple. Compare the following Chinese sentences of the same form NP NP V:

2a)           dianxin  wo           chi           le.
                Dim-Sum I               eat           LE.
The Dim Sum I have eaten.
Note:        LE is a particle for perfect aspect.

2b)   wo dianxin chi le.
I have eaten the Dim Sum.

Who eats what? There is no formal way but to resort to the semantic constraint imposed by the notion eat to reach the correct interpretation [Li, W. & McFetridge 1995].

Of course, if we want to maintain the purity of syntax, it could be argued that syntax will only render possible interpretations and not the interpretation.  It is up to other components (semantic filter and/or other filters) of grammar to decide which interpretation holds in a certain context or discourse. The power of syntax lies in the ability to identify structural ambiguities and to render possible corresponding interpretations. We call this type of linguistic design a syntax-before-semantics model. While this is one way to organize a  grammar, we found it unsatisfactory for two reasons. First, it does not seem to simulate the linguistic process of human comprehension closely.  For human listeners, there are no ambiguities involved in sentences 2a) and 2b). Secondly, there is considerable cost on processing efficiency in terms of computer implementation. This efficiency problem can be very serious in the analysis of languages like Chinese with virtually no inflection.

Head-driven Phrase Structure Grammar (HPSG) [Pollard & Sag 1994, 1987] assumes a lexicalist approach to linguistic analysis and advocates an integrated model of syntax and the other components of grammar. It serves as a desirable framework for the integration of the semantic constraint in establishing syntactic structures and interpretations. Therefore, we proposed to enforce the semantic constraint that animate being eats food directly in the lexical entry chi  (eat) [Li, W. & McFetridge 1995]: chi (eat) requires an animate NP subject and a food NP object. It correctly addresses who-eats-what problem for sentences like 2a) and 2b). In fact, this type of semantic constraint (selection restriction) has been widely used for disambiguation in NLP systems.

The problem is, the constraint should not always be enforced. In the practice of communication, deviation from the constraint is common and deviation is often deliberately applied to help render rhetorical expressions.


3) xiang      chi           yueliang,  ni             gou           de3    zhao       me?
    want        eat           moon,       you          reach       DE3  -able          ME?
Wanting to eat the moon, but can you reach it?
Note:  DE3 is a particle, introducing a postverbal adjunct of result or capability. ME is a sentence final particle for yes-no question.

4) dajia         dou   chi           shehui zhuyi,           neng         bu            qiong       me?
     people      all      eat           social -ism,               can            not           poor         ME
Everyone is eating socialism, can it not be poor?

yueliang (moon) is not food, of course. It is still some physical object, though. But in 4), shehui zhuyi (socialism) is a purely abstract notion. If a parser enforces the rigid semantic constraint, there are many such sentences that will be rejected without getting a chance to be interpreted. The fact is, we do have interpretations for 3) and 4). Hence an adequate grammar should be able to accommodate those interpretations.

To capture such deviation, Wilks came up with his Preference Semantics [Wilks 1975, 1978]. A sophisticated mechanism is designed to calculate the semantic weight for each possible interpretation, i.e. how much it deviates from the preference semantic constraint. The final choice will be given to the interpretation with the most semantic weight in total. His preference model simulates the process of how human comprehends language more closely than most previous approaches.

The problem with this design is the serious computational complexities involved in the model [Huang 1987]. In order to calculate the semantic weight, the preference semantic constraint is loosened step by step. Each possible substructure has to be re-tried with each step of loosening. It may well lead to combinatorial explosion.

What we are proposing here is to look at semantic deviation in the light of the interaction of the syntactic constraint and the semantic constraint. In concrete terms, the loosening of the semantic constraint is conditioned by syntactic patterns. Syntactic pattern is defined as the representation of an argument structure in surface form. A pattern consists of 2 parts: a structure’s syntactic constraint (in terms of the syntactic categories and configuration, word order,  function words and/or inflections) and its interpretation (role assignment). For example, for Chinese transitive structure, NP V NP: SVO is one pattern, NP NP V: SOV is another pattern, and NP [ba NP] V: SOV (the BA construction) is still another. The expressive power of a language is indicated by the variety of patterns used in that language. Our design will account for some semantic deviation or rhetorical phenomena seen in everyday Chinese without the overhead of computational complexities. We will focus on Chinese transitive verb patterns for illustration of this approach.

  1. Chinese transitive patterns

Assuming three notional signs wo (I), chi (eat) and dianxin (Dim Sum), there are maximally 6 possible combinations in surface word order, out of which 3 are grammatical in Chinese.[1]

5a)           wo chi le dianxin.                                   SVO
5b)           wo dianxin chi le.                                   SOV
5c)           dianxin wo chi le.                                    OSV

SVO is the canonical word order for Chinese transitive structure. When a string of signs matches the order NP V NP, the semantic constraint has to yield to syntax for interpretation.


6)  daodi         shi     ni             zai         du       shu          ne,
haishi                 shu           zai         du       ni             ne?

     on-earth     be     you          ZAI        read     book        NE,
or                        book        ZAI        read     you          NE?

Are you reading the book, or is the book reading you, anyway?
Note:        ZAI is a particle for continuous aspect.
NE is a sentence final particle for or-question.

Same as in the English equivalent, the interpretation of  6) can only be SVO, no matter how contradictory  it might be to our common sense. In other words, in the form of NP V NP, syntax plays a decisive role.

In contrast, to interpret the form NP NP V as SOV in 2b), the semantic constraint is critical. Without the enforcement of the semantic constraint, the interpretation of SOV does not  hold. In fact, this SOV pattern (NP1 NP2 V: SOV) has been regarded as ungrammatical in a Case Theory account for Chinese transitive structure in the framework of GB. According to their analysis, something similar to this pattern constitutes the D‑Structure for transitive pattern and Chinese is an underlying SOV language (called “SOV Hypothesis”: see the survey in Gao 1993). In the surface structure, NP2 is without case on the assumption that V assigns its CASE only to the right. One has to either insert the case-marker ba to assign CASE to it (the BA construction) or move it to the right of V to get its CASE (the SVO pattern). This analysis suffers from not being able to account for the grammaticality of sentences like 2b).  However, by distinguishing the deep pattern SOV from the 2 surface patterns (the SVO and the BA construction), the theory has its merit to alert us that the SOV pattern seems to be syntactically problematic (crippled, so to speak). This is an insightful point, but it goes one step too far in totally rejecting the SOV pattern in surface structure. If we modify this idea, we can claim that SOV is a syntactically unstable pattern and that SOV tends to (not must) “transform” to the SVO or the BA construction unless it is reinforced by semantic coherence (i.e. the enforcement of the semantic constraint). This argument in the light of syntax-semantics interaction is better supported by the Chinese data. In essence, our account is close to this reformulated argument, but in our theory, we do not assume a deep structure and transformation. All patterns are surface constructions. If no sentences can match a construction, it is not considered as a pattern by our definition.

This type of unstable pattern which depends on the semantic constraint is not limited to the transitive phenomena. For example, the type of Chinese NP predicate defined in  [Li, W. & McFetridge 1995] is also a semantics dependent pattern. Compare:

7a)  zhe           zhang       zhuozi                  san          tiao          tui.
        this           Cl.         table(furniture)      three        Cl.            leg
This table is three-legged.
Note:        Cl for classifier.

7b) *        zhe           zhang       ditu                          san          tiao          tui.
                this           Cl.           map(non-furniture)  three        Cl.            leg

There is clearly a semantic constraint of the NP predicate on its subject: it should be furniture (or animate). Without this “semantic agreement”, Chinese NP is normally not capable of functioning as a predicate, as shown in 7b).

Between semantics dependent and semantics independent patterns, we may have partially dependent patterns. For example, in NP NP V: OSV, it seems that the semantic constraint on the initial object is less important than the semantic constraint on the subject.

8)   shitou                wo              ye   xiang  chi,    kexi      yao       bu      dong.
   stone(non-food)  I(animate) also want  eat,    pity       chew    not      -able

Even stones I also want to eat, but it’s such a pity that I am not able to chew them.

If the constraint on the object matches well, is the subject allowed to be semantically deviant?

9) ?          dianxin                     zhuozi                        chi           le.
                Dim-Sum(food)        table(non-animate)  eat           LE.

Those are the marginal cases, a grammar may choose to be more tolerable to accept it or to be more restrained to reject it.

Unlike SOV, but similar to its English counterpart, OSV is one type of Chinese topic constructions and the relationship between the initial O and V is of long distance dependency.

10a)  dianxin      wo     xiangxin   ni           yiwei        Lisi          chi           le.
          Dim-Sum    I         believe     you          think        Lisi           eat           LE

The Dim Sum I believe you think that Lisi ate.

10b) *      Lisi wo xiangxin ni yiwei dianxin chi le.

10b) will not be accepted in our model because (1) it cannot be interpreted as OSV since it violates the semantic constraint on S: dianxin is not animate; (2) it can neither be interpreted as SOV since it violates the configurational constraint: SOV is simply not of a long distance pattern. In fact, NP NP V: SOV is such a restricted pattern in Chinese that it not only excludes any long distance dependency but even disallows some adjuncts. Compare 11a) in the OSV pattern and 11b) and 11c) in the SOV pattern:

11a)  dianxin      wo           jinjinyouwei             de2           chi           le.
          Dim-Sum      I              with-relish                DE2         eat           LE

The Dim Sum I ate with relish.
Note:        DE2 is a particle introducing a preverbal adjunct of  manner.

11b) *      wo dianxin jinjinyouwei de2 chi le.

11c) *      wo jinjinyouwei de2 dianxin chi le.

There is another pattern of the linear order SOV, the Chinese notorious BA construction. ba is usually regarded as a preposition which introduces a preverbal object for transitive verbs.

NP [ba NP] V: SOV

12a)  wo           ba            dianxin       jinjinyouwei             de2          chi           le.
           I              BA           Dim-Sum     with-relish                DE2         eat           LE

I ate the Dim Sum with relish.

12b)         wo jinjinyouwei de2 ba dianxin  chi le.
With relish, I ate the Dim Sum.

12c)         dianxin  ba wo jinjinyouwei de2  chi le.
The Dim Sum ate me with relish.

12d)         dianxin jinjinyouwei de2 ba wo  chi le.
With relish, the Dim Sum ate me.

For the OSV order, there is another so-called BEI construction. The BEI construction is usually regarded as an explicit passive pattern in Chinese.

NP [bei NP] V: OSV

13a)        dianxin       bei          wo           chi           le.
                Dim-Sum     BEI          I               eat           LE

The Dim Sum was eaten by me.

13b)         wo bei dianxin  chi le.

I was eaten by the Dim Sum.

The BEI construction and the BA construction are both semantics independent. In fact, any pattern resorting to the means of function words in Chinese seems to be sufficiently independent of the semantic constraint.

To conclude, semantic deviation often occurs in some more independent patterns, as seen in 5d2), 6), 8), 12c), 12d), 13b). Close study reveals that different patterns result in different reliance on the semantic constraint, as summarized in the following table.

                syntactic pattern                                 semantic dependence

                NP V NP: SVO                                                    no dependence
                NP [ba NP] V: SOV                                            no dependence
                NP [bei NP] V: OSV                                           no dependence
                NP NP V: OSV                                                    partial dependence
                NP NP V: SOV                                                    full dependence

It should be emphasized that this observation constitutes the rationale behind our approach.

  1. Formulation of lexical rules

Based on the above observation, we have designed a syntax-semantics combined model. In this model, we take a lexical rule approach to Chinese patterns and the related problem of semantic deviation.

A lexical rule takes as its input a lexical entry which satisfies its condition and generates another entry. Lexical rules are usually used to cover lexical redundancy between related patterns. The design of lexical rules is preferred by many grammarians over the more conventional use of syntactic transformation, especially for lexicalist theories.

Our general design is as follows, still using chi (eat) for illustration:

(1)   Syntactically, chi (eat) as a transitive verb subcategorizes for a left NP as its subject and a right NP as its object.

(2)   Semantically, the corresponding notion eat expects an entity of category animate as its logical subject and an entity of category food as its logical object. Therefore the common sense (knowledge) that animate being eats food is represented.

(3)   The interaction of syntax and semantics is implemented by lexical rules. The lexical rules embody the linguistic generalizations about the transitive patterns. They will decide to enforce or waive the semantic constraint based on different patterns.

As seen, syntax only stipulates the requirement of two NPs as complements for chi and does not care about the NPs’ semantic constraint. Semantics sets its own expectation of animate entity and food entity as arguments for eat and does not care what syntactic forms these entities assume on the surface. It is up to lexical rules to coordinate the two. In our model, the information in (1) and (2) is encoded in the corresponding lexical entry and the lexical rules in (3) will then be applied to expand the lexicon before parsing begins. Driven by the expanded lexicon, analysis is implemented by a lexicalist parser to build the interpretation structure for the input sentence. Following this design, there will be sufficient interaction between syntax and semantics as desired while syntax still remains to be a self-contained component from semantics in the lexicon. More importantly, this design does not add any computational complexities to parsing because in order to handle different patterns, the similar lexical rules are also required even for a pure syntax model.

Before we proceed to formulate lexical rules for transitive patterns, we should make sure what a transitive pattern is. As we defined before, a pattern consists of 2 parts: a structure’s syntactic constraint and the corresponding interpretation. Word order is important constraint for Chinese syntax. In addition to word order, we have categories and function words (preposition, particle, etc.). As for interpretation, transitive structure involves 3 elements: V (predicate) and its arguments S (logical subject) and O (logical object). There is a further factor to take into account: Chinese complements are often optional. In many cases, subject and/or object can be omitted either because they can be recovered in the discourse or they are unknown. We call those patterns  elliptical patterns (with some complement(s) omitted), in contrast to full patterns. With these in mind, we can define 10 patterns for Chinese transitive structure: 5 full patterns and 5 elliptical patterns.

We now investigate these transitive patterns one by one and try to informally formulate the corresponding lexical rules to capture them. Please note that the basic input condition is the same with all the lexical rules. This is because they share one same argument structure – transitive structure.

Lexical rule 1:   

                V ((NP1, NP2), (constr1, constr2)) –> NP1 V NP2: SVO

The above notation for the lexical rule should be quite obvious. The input of the rule is a transitive verb which subcategorizes for two NPs: NP1 and NP2 and whose corresponding notion expects two arguments of constr1 and constr2NP is syntactic category, and constr is semantic category (human, animate, food, etc.). The output pattern is in a defined word order SVO and waives the semantic constraint.

Lexical rule 2:   

      V ((NP1, NP2), (constr1, constr2)) –> [NP1, constr1] [NP2, constr2] V: SOV

Please note that the semantic constraint is enforced for this SOV pattern. Since this pattern shares the form NP NP V with the OSV pattern, it would be interesting to see what happens if a transitive verb has the same semantic constraint on both its subject and object. For example, qingjiao (consult) expects a human subject and a human object.

14)           ta                     ni                               qingjiao    guo        me?
                he(human)     you(human)             consult     GUO        ME

Him, have you ever consulted?
Note: GUO is a particle for experience aspect.

15)           ni ta  qingjiao guo  me?

You, has he ever consulted?

In both cases, the interpretation is OSV instead of SOV. Therefore, we need to reformulate Lexical rule 2 to exclude the case when the subject constraint is the same as the object constraint.

Lexical rule 2′ (refined version):

                V ((NP1, NP2), (constr1, constr2), (constr1 not = constr2))

                –> [NP1, constr1] [NP2, constr2] V: SOV

Lexical rule 3:

                V ((NP1, NP2), (constr1, constr2)) –> NP1 [ba NP2] V: SOV

This is the typical BA construction. But not every transitive verb can assume the BA pattern. In fact, ba is one of a set of prepositions to introduce the logical object. There are other more idiosyncratic prepositions (xiang, dao, dui, etc.) required by different verbs to do the same job.

16a)      ni             qingjiao    guo         ta             me?
              you          consult     GUO        he            ME

Have you ever consulted him?

16b)         ni             xiang        ta             qingjiao    guo        me?
                 you          XIANG     he            consult     GUO        ME

Have you ever consulted him?

16c) *      ni             ba            ta             qingjiao    guo        me?
                you          BA           he            consult     GUO        ME

17a)         ta             qu             guo         Beijing.
                 he            go-to        GUO        Beijing

He has been to Beijing.

17b)         ta             dao         Beijing     qu             guo.
                 he            DAO        Beijing     go-to        GUO

He has been to Beijing.

17c) *      ta             ba            Beijing     qu            guo.
                 he            BA           Beijing     go-to        GUO

18a)         ta             hen         titie                             zhangfu.
                 she           very       tenderly-care-for      husband

She cares for her husband very tenderly.

18b)         ta             dui          zhangfu       hen        titie.
                 she           DUI         husband      very       tenderly-care-for

She cares for her husband very tenderly.

18c) *      ta             ba            zhangfu         hen                          titie.
                she           BA           husband         very                         tenderly-care-for

This originates from different theta-roles assumed by different verb notions on their object argument: patient, theme, destination, to name only a few. These theta-roles are further classification of the more general semantic role logical object. We can rely on the subcategorization property of the verb for the choice of the preposition literal (so-called valency preposition). With the valency information in place, we now reformulate Lexical rule 3 to make it more general:

Lexical rule 3′ (refined version):

       V ((NP1, NP2), (constr1, constr2),  (valency_preposition=P), (P not = null))

       –> NP1 [P NP2] V: SOV

Lexical rule 4:   

                V ((NP1, NP2), (constr1, constr2)) –> NP2 … [NP1, constr1] V: OSV

This is a topic pattern of long distance dependency. It is up to different formalisms to provide different approaches to long-distance phenomena. In our present implementation, NP2 is placed in a feature called BIND to indicate the nature of long distance dependency. One phrase structure rule Topic Rule is designed to use this information and handle the unification of the long distance complement properly.

Following the topic pattern, the passive BEI construction is formulated in Lexical rule 5.

Lexical rule 5:   

                V ((NP1, NP2), (constr1, constr2)) –> NP2 [bei NP1] V: OSV

We now turn to elliptical patterns.

Lexical rule 6:   

                V ((NP1, NP2), (constr1, constr2)) –> V NP2: VO

19)           chi           guo          jiaozi                        me?
                eat           GUO        dumpling                 ME

Have (you) ever eaten dumpling?

Lexical rule 7:   

                V ((NP1, NP2), (constr1, constr2)) –> [NP1, constr1] V: SV

20)           wo           chi           le.
                I               eat           LE

I have eaten (it).

21)           ji                                 chi           le.
                chicken1(animate)   eat           LE

The chicken has eaten (it).

Like its English counterpart, ji (chicken) has two senses: (1) chicken1 as animate; (2) chicken2 as food. We code this difference in two lexical entries. Only the first entry matches the semantic constraint on the subject in the pattern and reaches the above SV interpretation in 21). Interestingly enough, the same sentence will get another parse with a different interpretation OV in 23) because the second entry also satisfies the semantic constraint on the object in the OV pattern in Lexical rule 8.

22)           ni             qingjiao    guo         me?
                you          consult     GUO        ME

Have you consulted (someone)?

22) indicates that the SV interpretation is preferred over the OV interpretation when the semantic constraint on the subject and the semantic constraint on the object happen to be the same. Hence the added condition in Lexical rule 8.

Lexical rule 8:   

                V ((NP1, NP2), (constr1, constr2), (constr1 not = constr2))

                –> [NP2, constr2] V: OV

23)           ji                                 chi           le.
                chicken2(food)         eat           LE

The chicken has been eaten.

Lexical rule 9:   

                V ((NP1, NP2), (constr1, constr2)) –> NP2 [bei V]: OV

24)           dianxin    bei           chi           le.
                Dim-Sum  BEI          eat           LE

The Dim Sum has been eaten.

Lexical rule 10:

                V ((NP1, NP2), (constr1, constr2)) –> V: V

25)           chi           le             me?
                eat           LE            ME?                        

(Have you) eaten (it)?

  1. Implementation

We begin with a discussion of some major feature structures in HPSG related to handling the transitive patterns.  Then, we will show how our proposal works and discuss some related implementation issues.

HPSG is a highly lexicalist theory. Most information is housed in the lexicon. The general grammar is kept to minimum: only a few phrase structure rules (called ID Schemata) associated with a couple of principles. The data structure is typed feature structure. The necessary part for a typed feature structure is the type information. A simple feature structure contains only the type information, but a complex feature structure can introduce a set of feature/value pairs in addition to the type information. In a feature/value pair, the value is itself a feature structure (simple or complex). The following is a sample implementation of the lexical entry chi for our Chinese HPSG grammar using the ALE formalism [Carpenter  & Penn 1994].


Note:  (1) Uppercase notation for feature; (2) Lowercase notation for type; (3) Number indices in square brackets for unification.

Leaving the notational details aside, what this roughly says is: (1) for the semantic constraint, the arguments of the notion eat are an animate entity and a food entity; (2) for the syntactic constraint, the complements of the verb chi are 2 NPs: one on the left and the other on the right; (3) the interpretation of the structure is a transitive predicate with a subject and an object. The three corresponding features are: (1) KNOWLEDGE; (2) SUBCAT; (3) CONTENT. KNOWLEDGE stores some of our common sense by capturing the internal relation between concepts. Such common sense knowledge is represented in linguistic ways, i.e. it is represented as a semantic expectation feature, which parallels to the syntactic expectation feature SUBCAT. KNOWLEDGE defines the semantic constraint on the expected arguments no matter what syntactic forms the arguments will take.  In contrast, SUBCAT only defines the syntactic constraint on the expected complements. The syntactic constraint includes word order (LEFT feature), syntactic category (CATEGORY feature) and configurational information (LEX feature).  Finally, CONTENT feature assigns the roles SUBJECT and OBJECT for the represented structure.

A more important issue is the interaction of the three feature structures. Among the three features, only KNOWLEDGE is our add-on. The relationship between SUBCAT and CONTENT has been established in all HPSG versions: SUBCAT resorts to CONTENT for interpretation.  This interaction corresponds to our definition of pattern. Everything goes fine as far as the syntactic constraint alone can decide interpretation. When the semantic constraint (in KNOWLEDGE) has to be involved in the interpretation process, we need a way to access this information. In unification based theories, information flow is realized by unification (i.e. structure sharing, which is represented by the co-index of feature values). In general, we have two ways to ensure structure sharing in the lexicon. It is either directly co-indexed in the lexical entries, or it resorts to lexical rules. The former is unconditional, and the latter is conditional. As argued before, we cannot directly enforce the semantic constraint for every transitive pattern in Chinese, for otherwise our grammar will not allow for any semantic deviation. We are left with lexical rules which we have informally formulated in Section 3 and implemented in the ALE formalism.

CATEGORY is another major feature for a sign. The CATEGORY feature in our implementation includes functional category which can specify functional literal (function word) as its value. Function words belong to closed categories. Therefore, they can be classified by enumeration of literals. Like word order, function words are important form for Chinese syntactic constraint. Grammars for other languages also resort to some functional literals for constraint. In most HPSG grammars for English, for example, a preposition literal is specified in a feature called P_FORM. There are two problems involved there. First, at representation level, there is redundancy: P_FORM:x –> CATEGORY:p (where x is not null). In other words, there exists feature dependency between P_FORM and CATEGORY which is not captured in the formalism. Second, if P_FORM is designed to stipulate a preposition literal, we will ultimately need to add features like CL_FORM for classifier specification, CO_FORM for conjunction specification, etc. In fact, for each functional category, literal specification may be required for constraint in a non-toy grammar. That will make the feature system of the grammar too cumbersome. These problems are solved in our grammar implementation in ALE. One significant mechanism in ALE is its type inheritance and appropriateness specifications for feature structures [Carpenter  & Penn 1994]. (Similar design is found in the new software paradigm of Object Oriented Programming.) Thanks to ALE, we can now use literals (ba, xiang, dao, dui, etc) as well as major categories (n, v, a, p, etc.) to define the CATEGORY feature. In fact, any intermediate level of subclassification between these two extremes, major categories and literals, can all be represented in CATEGORY just as handily. They together constitute a type hierarchy of CATEGORY. The same mechanism can also be applied to semantic categories (human, animate, food, etc.) to capture the thesaurus inference like human –> animate. This makes our knowledge representation much more powerful than in those formalisms without this mechanism. We will address this issue in depth in another paper Typology for syntactic category and semantic category in Chinese grammar.

In the following, we give a brief description on how our grammar works. The grammar consists of several phrase structure rules and a lexicon with lexical entries and lexical rules. First, ALE compiles the grammar into a Prolog parser. During this process (at compile time), lexical rules are applied to lexical entries. In the case of transitive patterns, this means that one entry of chi will evolve into 10 entries. Please note that it is this expanded lexicon that is used for parsing (at run time).

At the level of implementation, we do not need to presuppose an abstract transitive structure as input of the lexical rules and from there generates 10 new entries for each transitive verb. What is needed is one pattern as the basic pattern for transitive structure and derives the other patterns. In fact, we only need 4 lexical rules to derive the other 4 full patterns from 1 basic full pattern. Elliptical patterns can be handled more elegantly by other means than lexical rules.[2]

The basic pattern constitutes the common condition for lexical rules. Although in theory any one of the 5 full patterns can be seen as the basic pattern, the choice is not arbitrarily made. The pattern we chose is the valency preposition pattern (the BA-type construction) NP1 [P NP2] V: SOV (see Lexical rule 3′).[3] This is justified as follows. The valency preposition P (ba, xiang, dao, dui, etc.) is idiosyncratically associated with the individual verb. To derive a more general pattern from a specific pattern is easier than the other way round, for example,  NP1 [P NP2] V: SOV –> NP1 V NP2: SVO is easier than NP1 V NP2: SVO –> NP1 [P NP2] V: SOV. This is because we can then directly code the valency preposition under CATEGORY in the SUBCAT feature and do not have to design a specific feature to store this valency information.


  1. Summery

The ultimate aim for natural language analysis is to reach interpretation, i.e. to assign roles to the constituents. An old question is how syntax (form) and semantics (meaning) interact in this interpretation process. More specifically, which is a more important factor in Chinese analysis, the syntactic constraint or the semantic constraint? For the linguistic data we have investigated, it seems that sometimes syntax plays a decisive role and other times semantics has the final say. The essence is how to adequately handle the interface between syntax and semantics.

In our proposal, the syntactic constraint is seen as a more fundamental factor. It serves as the frame of reference for the semantic constraint. The involvement of the semantic constraint seems to be most naturally conditioned by syntactic patterns. In order to ensure their effective interaction, we accommodate syntax and semantics in one model.  The model is designed to be based on syntax and resorts to semantic information only when necessary. In concrete terms, the system will selectively enforce or waive the semantic constraint, depending on syntactic patterns.

It needs to be advised that there are other factors involved in reaching a correct interpretation. For example, in order to recover the omitted complements in elliptical patterns, information from discourse and pragmatics may be vital. We leave this for future research.



Carpenter, B. & Penn, G. (1994): ALE, The Attribute Logic Engine, User’s Guide, Version 2.0

Gao, Qian (1993): “Chinese BA-Construction: Its Syntax and Semantics”, OSU Working Papers in Linguistics 1993, Kathol A. & Pollard C. (eds.)

Huang, Xiuming (1987): “XTRA: The Design and Implementation of A Fully Automatic Machine Translation System”, Ph.D. dissertation.

Li, Audry (1990): Chapter 6 “Passive, BA, and topic constructions”, Order & Constituency in Mandarin Chinese. Kluwer Academic Publishers

Li, Wei & McFetridge, Paul (1995): “Handling Chinese NP predicate in HPSG”, Proceedings of PACLING-II, Brisbane, Australia

Pollard, Carl  & Sag, Ivan A. (1994): Head-Driven Phrase Structure Grammar,  Centre for the Study of Language and Information, Stanford University, CA

Pollard, Carl  & Sag, Ivan A. (1987): Information-based Syntax and Semantics. Vol. 1: Fundamentals. Centre for the Study of Language and Information, Stanford University, CA

Wilks, Y.A. (1978): “Making Preferences More Active”,  Artificial Intelligence, Vol. 11

Wilks, Y.A. (1975): “A Preferential Pattern-Seeking Semantics for Natural Language Interference”, Artificial Intelligence, Vol. 6


* This research is part of my Ph.D. project on a Chinese HPSG-style grammar, supported by the Science Council of British Columbia, Canada under G.R.E.A.T. award (code: 61). I thank my supervisor Dr. Paul McFetridge for his supervision. He introduced me into the HPSG theory and provided me with his sample grammars. Without his help, I would not have been able to implement the Chinese grammar in a relatively short time. Thanks also go to Prof. Dong Zhen Dong and Dr. Ping Xue for their comments and encouragement.


[1]               The other combinations are:

5d1) *      dianxin chi le wo.              OVS

5d2)         dianxin chi le wo.
The Dim Sum ate me.

Note:        It is OK with the 5d2) reading in the pattern NP V NP: SVO.

5e1) *      chi le wo dianxin.               VSO
5e2)         chi le wo dianxin.

(Somebody) ate my Dim Sum.

Note:        It is OK with the 5e2) reading of in the pattern V [NP1 NP2]: VO where NP1 modifies NP2.

5f1) *      chi le dianxin wo.                 VOS
5f2)         chi le dianxin, wo.

Eaten the Dim Sum, I have.

Note:        It is OK in Spoken Chinese, with a short pause before wo, in a  pattern like V NP, NP: VOS.

[2]   The conventional configurational approach is based on the assumption that complements are obligatory and should be saturated. If saturation of complements were not taken as a precondition for a phrase, serious problems might arise in structural overgeneration. On the other hand, optionality of complement(s) is a real life fact. Elliptical patterns are seen in many languages and especially commonplace in Chinese. In order to ensure obligatoriness of complements, the lexical rule approach can be applied to elliptical patterns, as shown in Section 3. This approach maintains configurational constraint in tree building to block structural overgeneration, but the cost is great: each possible elliptical pattern for a head will have to be accommodated by a new lexical entry. With the type mechanism provided by ALE, we have developed a technique to allow for optionality of complement(s) and still maintain proper configurational constraint. We will address this issue in another paper Configurational constraint in Chinese grammar.

[3]    This choice is coincidental to the base‑generated account of the BA construction in [Li, A. 1990], but that does not mean much. First, our so‑called basic pattern is not their D‑Structure. Second, our choice is based on more practical considerations. Their claim involves more theoretical arguments in the context of the generative grammar.




Handling Chinese NP predicate in HPSG (old paper)

Notes for An HPSG-style Chinese Reversible Grammar

Outline of an HPSG-style Chinese reversible grammar

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

PhD Thesis: Chapter I Introduction

PhD Thesis: Chapter II Role of Grammar

PhD Thesis: Chapter III Design of CPSG95

PhD Thesis: Chapter IV Defining the Chinese Word

PhD Thesis: Chapter V Chinese Separable Verbs

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

PhD Thesis: Chapter VII Concluding Remarks

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

Handling Chinese NP predicate in HPSG (old paper)

Handling Chinese NP predicate in HPSG
(old paper in Proceedings of the Second Conference of the Pacific
Association for Computational Linguistics, Brisbane, 1995)

Wei Li & Paul McFetridge

Department of Linguistics
Simon Fraser University
Burnaby, B.C. CANADA  V5A 1S6


Key words: HPSG; knowledge representation, Chinese processing 



This paper addresses a type of Chinese NP predicate in the framework of HPSG 1994 (Pollard & Sag 1994). The special emphasis is laid on knowledge representation and the interaction of syntax and semantics in natural language processing. A knowledge based HPSG model is designed. This design not only lays a foundation for effectively handling Chinese NP predicate problem, but has theoretical and methodological significance on NLP in general.

In Section 1, the data are analyzed. Both structural and semantic constraints for this pattern are defined. Section 2 discusses the semantic constraints in the wider context of the conceived knowledge-based model. The aim of natural language analysis is to reach interpretations, i.e. correctly assigning semantic roles to the constituents. We indicate that without being able to resort to some common sense knowledge, some structures cannot get interpreted. We present a way on how to organize and utilize knowledge in HPSG lexicon. In Section 3, a lexical rule for this pattern is proposed in our HPSG model for Chinese, whose prototype is being implemented.

  1. Problem

We will show the data of Chinese NP predicate first. Then we will investigate what makes it possible for an NP to behave like a predicate. We will do this by defining both the syntactic and semantic constraints for this Chinese pattern.

1.1. Data: one type of Chinese NP predicate

1) 他好身体。

ta         hao      shenti.
he        good    body
He is of good health.

2)  张三高个子。

Zhangsan         gao      gezi
Zhangsan         tall       figure.
Zhangsan is tall.

3)  李四圆圆的脸。       Lisi

Lisi      yuanyuan         de        lian.
Lisi      round-round    DE       face.
Lisi has a quite round face.

4) 这件大衣红颜色。

zhe       jian      dayi     hong    yanse.
this      (cl.)      coat     red       colour.
This coat is of red colour.

5)  明天小雨。

mingtian          xiao     yu.
tomorrow        little     rain.
Tomorrow it will drizzle.

6)  那张桌子三条腿。

na        zhang   zhuozi san       tiao      tui.
that      (cl.)      table   three    (cl.)      leg
That table is three-legged.

Note:      (cl.) for classifier.
DE for Chinese attribute particle.

The relation between the subject NP and the predicate NP is not identity. The NP predicate in Chinese usually describes a property the subject NP has, corresponding to English be-of/have NP. In identity constructions, the linking verb SHI (be) cannot normally be omitted.[1]

7a)  他是学者。

ta         shi        xuezhe.
he        be        scholar
He is a scholar.

8b) ?他学者。

ta         xuezhe.  他学者。
he        scholar

1.2.  Problem analysis

1.2.1. We first investigate the structural characteristics of the Chinese NP predicate pattern.

A single noun cannot act as predicate. More restrictively, not every NP can become a predicate. It seems that only the NP with the following configuration has this potential: NP [lex -, predicate +].  In other words, a predicate NP consists of a lexical N with a modifying sister. Structures of this sort should not be further modified.[2] Thus, the following patterns are predicted.

8a)      那张桌子三条腿。

na        zhang   zhuozi san       tiao      tui.                   [ same as 6) ]
that      (cl.)      table    three    (cl.)      leg
That table is three-legged.

8b)       那张桌子塑料腿。

na        zhang   zhuozi suliao   tui.
that      (cl.)      table    plastic leg
That table is of plastic legs.

8c) * 那张桌子三条塑料腿。
*    na        zhang   zhuozi san       tiao      suliao   tui.       [too many attributes]

8d) * 那张桌子腿。
*    na        zhang   zhuozi tui.                                           [no attributes]

1.2.2. What is the semantic constraint for the Chinese predicate pattern?

Although there is no syntactic agreement between subject and predicate in Chinese, there is an obvious semantic “agreement” between the two: hao shenti (good body) requires a HUMAN as its subject; san tiao tui (three leg) demands that the subject be FURNITURE or ANIMATE. Therefore, the following are unacceptable:

9) * 这杯茶好身体。

* zhe       bei       cha       hao      shenti.
this      cup      tea       good    body

10) * 空气三条腿。

* kongqi san       tiao      tui.
air        three    (cl.)      leg

Obviously,. it is not hao (good) or san tiao (three) which poses this semantic selection of subject. The semantic restriction comes from the noun shenti (body) or tui (leg). There is an internal POSSESS relationship between them: shenti (body)  belongs to human beings and tui (leg) is one part of an animal or some furniture. This common sense relation is a crucial condition for the successful interpretation of the Chinese NP predicate sentences.

There are a number of issues involved here. First, what is the relationship of this type of knowledge to the syntactic structures and semantic interpretations? Second, where and how would this knowledge be represented? Third, how will the system use the knowledge when it is needed? More specifically, how will the introduction of this knowledge coordinate with the other parts of the well established HPSG formalism? Those are the questions we attempt to answer before we proceed to provide a solution to the Chinese NP predicate. Let us look at some more examples:

11a)     桌子坏了。

zhuozi huai     le.
table    bad      LE
The table went wrong.

11b)     腿坏了。

tui        huai     le.leg       bad      LE
leg       bad      LE
The leg went wrong.

11c)     桌子的腿坏了。

zhuozi  de        tui        huai     le.
table    DE       leg       bad      LE
The table’s leg went wrong.

12a)     他好。

ta         hao.
he        good
He is good.

12b)     身体好。

shenti   hao.
body    good
The health is good.

12c)     他的身体好。

ta         de        shenti   hao.
he        DE       body    good
His health is good.

note: LE for Chinese perfect aspect particle.

When people say 11b) tui huai le (leg went wrong), we know something (the possessor) is omitted. For 11a), however, we have no such feel of incompleteness. Although we may also ask whose table, this possessive relation between who and table is by no means innate. Similarly, ta (he) in 12a) is a complete notion denoting someone while shenti (body) in 12b) is not. In 11c) and 12c), the possessor appears in the possessive structure DE-construction, the expectation of tui (leg) and shenti (body) is realized. These examples show that some words (concepts) have conceptual expectation for some other words (concepts) although the expected words do not necessarily show up in a sentence and the expectation might not be satisfied. In fact, this type of expectation forms part of our knowledge (common sense). One way to represent the knowledge is to encode it with the related word in the lexicon.

Therefore we propose an underlying SYNSEM feature KNOWLEDGE to store some of our common sense knowledge by capturing the internal relation between concepts. KNOWLEDGE parallels to syntactic SUBCAT and semantic RELATION. KNOWLEDGE imposes semantic constraints on their expected arguments no matter what syntactic forms the arguments will take (they may take null form, i.e. the underlying arguments are not realized). In contrast, SUBCAT only defines syntactic requirement for the complements and gets interpreted in RELATION. Following this design, syntactic form and semantic constraints are kept apart. When necessary, the interaction between them can be implemented by lexical rules, or directly coindexed in the lexicon. For example, the following KNOWLEDGE information will be enforced as the necessary semantic constraints when we handle Chinese NP predicates by a lexical rule (see 3.3).

PHON      shenti

  1. Agreement revisited

This section relates semantic constraints which embody common sense to the conventional linguistic notion of agreement. We will show that they are essentially the same thing from different perspectives. We only need slight expansion for the definition of agreement to accommodate some of our basic knowledge. This is important as it accounts for the feasibility of coding knowledge in linguistic ways. Linguistic lexicon seems to be good enough to house some general knowledge in addition to linguistic knowledge. Some possible problems with this knowledge-based approach are also discussed.

Let’s first consider the following two parallel agreement problems in English:

13) *    The boy drink.

14) ?    The air drinks.

13) is ungrammatical because it violates the syntactic agreement between the subject and predicate. 14) is conventionally considered as grammatical although it violates the semantic agreement between the agent and the action. Since the approach taken in this paper is motivated by semantic agreement, some elaboration and comment on agreement seem to be in need.

The agreement in person, gender and number are included in CONTENT | INDEX features (Pollard & Sag 1994, Chapter 2). It follows that any two signs co-indexed naturally agree with each other. That is desirable because co-indexed signs refer to the same entity. However, person, gender and number seem to be only part of the story of agreement. We may expand the INDEX feature to cope with the semantic agreement for handling Chinese and for in-depth semantic analysis for other languages as well.

Note that to accommodate semantic agreement in HPSG, we first need features to represent the result of semantic classification of lexical meanings like HUMAN, FOOD, FURNITURE, etc. We therefore propose a ROGET feature (named after the thesaurus dictionary) and put it into the INDEX feature.

Semantic agreement, termed sometimes as semantic constraint or semantic selection restriction in literature, is not a new conception in natural language processing. Hardly any in-depth language analysis can go smoothly without incorporating it to a certain extent. For languages like Chinese with virtually no inflection, it is more important. We can hardly imagine how the roles can be correctly assigned without the involvement of semantic agreement in the following sentences of the form NP1 NP2 Vt:

15a)     点心我吃了。

dianxin            wo       chi       le.
Dim-Sum         I           eat       LE
The Dim Sum I have eaten.

15b)     我点心吃了。

wo       dianxin            chi       le.
I           Dim-Sum         eat       LE
I have eaten the Dim Sum.

Who eats what?  There is no formal way but to resort to semantic agreement enforced by eat to correctly assign the roles. In HPSG 1994, it was pointed out (Pollard & Sag 1994, p81), “… there is ample independent evidence that verbs specify information about the indices of their subject NPs. Unless verbs ‘had their hands on’ (so to speak) their subjects’ indices, they would be unable to assign semantic roles to their subjects.” The Chinese data show that sometimes verbs need to have their hands on the semantic categories (ROGET) of both their external argument (subject) and internal arguments to be able to correctly assign roles. Now we have expanded the INDEX feature to cover both ROGET and the conventional agreement features number, person and gender, the above claim of Pollard and Sag becomes more general.

It is widely agreed that knowledge is bound to play an important role in natural language analysis and disambiguation. The question is how to build a knowledge-based system which is manageable. Knowledge consists of linguistic knowledge (phonology, morphology, syntax, semantics, etc.) and extra-linguistic knowledge (common sense, professional knowledge, etc.). Since semantics is based on lexical meanings, lexical meanings represent concepts and concepts are linked to each other in a way to form knowledge, we can well regard semantics as a link between linguistics and beyond-linguistics in terms of knowledge. In other words, some extra-linguistic knowledge may be represented in linguistic ways. In fact, lexicon, if properly designed, can be a rich source of knowledge, both linguistic and extra-linguistic. A typical example of how concepts are linked in a network (a sophisticated concept lexicon) is seen in the representation of drink ((*ANI SUBJ) (((FLOW STUFF) OBJE) ((SELF IN) (((*ANI (THRU PART)) TO) (BE CAUSE))))) in Wilks 1975b. While for  various reasons we will not go as far as Wilks, we can gain enlightenment from this type of AI approach to knowledge. Lexicon-driven systems like the one in HPSG can, of course, make use of this possibility. Take the Chinese role-assignment problem, for example, the common sense that ANIMATE being eats FOOD can be seamlessly incorporated in the lexical entry chi (eat) as a semantic agreement requirement.

PHON chi

Note:        Following the convention, the part after the colon is SYNSEM | LOCAL | CONTENT information.

One last point we would like to make in this context is that semantic agreement, like syntactic agreement, should be able to loosen its restriction, in other words, agreement is just a canonical, in Wilk’s term preference, requirement (Wilks 1975a). In practice of communication, deviation in different degrees is often seen and people often relax the preference restriction in order to understand. With semantic agreement, the deliberate deviation is one of the handy means to help render rhetorical expression. In a certain domain, Chomsky’s famous sentence Colorless green ideas sleep furiously is well imaginable. On the other hand, the syntactic agreement deviation will not affect the meaning if no confusion is caused, which may or may not happen depending on context and the structure of the language. In English, lack of syntactic agreement for the present third person singular between subject and predicate usually causes no problem. Sentence 15) The boy drink therefore can be accepted and correctly interpreted. There is much more to say on the interaction of the two types of agreement deviation, how a preference model might be conceived, what computational complexities it may cause and how to handle them effectively. We plan to address it in another paper. The interested reader is referred to one famous approach in this direction. (Wilks 1975a, 1978).


  1. Solution

We will set some requirements first and then present a lexical rule to see how well it meets our requirements.

3.1. Based on the discussion in Section 1, the solution to the Chinese predicate NP problem should meet the following 4 requirements:

(1)        It should enforce the syntactic constraints for this pattern: one and only one modifier XP in the form of NP1 XP NP2.

(2)        It should enforce the semantic constraints for this pattern: N2 must expect NP1 as its POSSESSOR with semantic agreement.

(3)        It should correctly assign roles to the constituents of the pattern: NP1 POSSESS NP2 (where NP2 consists of XP N2).

(4)        It should be implementable in HPSG formalism.


3.2. What mechanisms can we use to tackle a problem in HPSG formalism?

HPSG grammar consists of two components: a general grammar (ID schemata and principles) and a lexical grammar (in the lexicon). The lexicon houses lexical entries with their linguistic description and knowledge representation in feature structures. The lexicon also contains generalizations captured by inheritance of lexical hierarchy and by a set of lexical rules. Roughly speaking, lexical hierarchy covers static redundancy between related potential structures. Just because the lexicon can reflect different degrees of lexical redundancy in addition to idiosyncrasy, the general grammar can desirably be kept to minimum.

The Chinese NP predicate pattern should be treated in the lexicon. There are two arguments for that. First, this pattern covers only restricted phenomena (see 3.4). Second, it relies heavily on the semantic agreement, which in our model is specified in the lexicon by KNOWLEDGE. We need somehow to link the semantic expectation KNOWLEDGE and the syntactic expectation SUBCAT or MOD. The general mechanism to achieve that is structure sharing by coindexing the features either directly in the lexical entries (see the representation of the entry chi in Section 2) or through lexical rules (see 3.3).

3.3. Lexical Rule

Lexical rules are applied to lexical signs (words, not phrases) which satisfy the condition. The result of the application is an expanded lexicon to be used during parsing. Since the pattern is of the form NP1 XP N2, the only possible target is N2, i.e. shenti (body) or tui (leg). This is due to the fact that among the three necessary signs in this form, the first two are phrases and only the final N2 is a lexical sign. We assume the following structure for our proposed lexical rule:

NP[ta[1]]         [[AP[2] hao] [N<NP[1], XP[2]> shenti]]

NP Predicate Lexical Rule




…| CONTENT | RELATION [1] possess

For complicated information flow like this, it is best to explain the indices one by one with regards to the example ta hao shenti (he is of good body) in the form of NP1 XP N2.

The index [1] links the underlying PRED feature of N2 to the semantic RELATION feature; in other words, the predicate in the underlying KNOWLEDGE of shenti (body) now surfaces as the relation for the whole sentence. The index [2] enforces the semantic constraint for this pattern, i.e. shenti (body) expects a human (ROGET) possessor as the subject (EXTERNAL_ARGUMENT) for this sentence. The index [3] is the restriction relation of N2. [4] links the INDEX features of XP and N2, and [6] indicates that the internal argument is a de-facto modifier of N2, i.e. XP mods-for N2. Note that the part of speech of the internal argument (INTERNAL_ARGUMENT | SYNSEM | LOCAL | CATEGORY | HEAD | MAJ) is deliberately not specified in the rule because Chinese modifiers (XP) are not confined to one class, as can be seen in our linguistic data. Finally, [7] defines the restriction relation of the XP to the INDEX of N2.

The indices [4], [7] and [3] all contribute to artificially creating a semantic interpretation for [XP N2]. As is interpreted, XP is, in fact, a modifier of N2 and they would form an NP2, or [XP N2] constituent. In normal circumstances, the building of NP2 interpretation is taken care of by HPSG Semantics Principle. But in this special pattern, we have treated XP as a complement of N2, yet semantically they are still understood as one instance: hao shenti (good body) is an instance of good and body. This interpretation of NP2 serves as POSSESSED of the sentence predicate, indicated by the structure-sharing of [4], [7] and [3]. Finally, [5] is the interpretation of NP1 and is assigned the role of POSSESSOR for the sentence predicate.

Let’s see how well this lexical rule meets the 4 requirements set in 3.1.

(1) It enforces the syntactic constraints by treating XP as the internal argument and NP1 as the external argument.

(2) It enforces the semantic constraints through structure sharing by the index [2].

(3) It correctly assigns roles to the constituents of the pattern.

The following interpretation will be established for ta hao shenti (he is of good body) by the parser.


CONTENT | POSSESSED | INDEX          | NUMBER singular
CONTENT | POSSESSED | RESTRICTION { [ RELATION good],              [ RELATION body  ] }
CONTENT | POSSESSED | RESTRICTION { [ INSTANCE [1] ],              [ INSTANCE [1]  ] }

In prose, it says roughly that a third person male human he possesses something which is an instance of good body. We believe that this is the adequate interpretation for the original sentence.

(4) Last, this rule has been implemented in our Chinese HPSG-style grammar using ALE and Prolog.  The results meet our objective.

But there is one issue we have not touched yet, word order. At first sight, Chinese seems to have similar LP constraints as those in English. For example, the internal argument(s) of a Chinese transitive verb by default appear on the right side of the head. It seems that our formulation contradicts this constraint in grammar. But in fact, there are many other examples with the internal argument(s), especially PP argument(s), appearing on the left side of the head.

服务 fuwu (serve): <NP, PP(wei)>

16a) 为人民服务

wei      renmin fuwu
for       people  serve
Serve the people.

16b) ? 服务为人民。

fuwu    wei      renmin.
serve    for       people

有益 youyi (of benefit): <NP, PP(dui yu)>

17a) 这对我有益。

zhe       dui       wo       youyi
this      to         I           have-benefit
This is of benefit to me.

17b) * 这有益对我。

zhe       youyi               dui       wo
this      have-benefit    to         I

18a) 这于我有益。

zhe       yu        wo       youyi
this      to         I           have-benefit
This is of benefit to me.

18b) 这有益于我。

zhe       youyi               yu        wo
this      have-benefit    to         I
This is of benefit to me.

Word order and its place in grammar are important issues in formulating Chinese grammar. To play safe and avoid generalization too soon, we assume a lexicalized view on Chinese LP constraint, encoding word order information in LEXICON through SUBCAT and MOD features. This proves to be a realistic and precise approach to Chinese word order phenomena.

3.4. As a final note, we will briefly compare the NP Predicate Pattern with one of the Chinese Topic Constructions:

NP1 NP2 Vi/A
(topic + (subject + predicate))

In Chinese, this is a closely related but much more productive form than this NP Predicate Pattern. And their structures are different.

19)       他身体好。

ta         shenti   hao
he        body    good
He is good in health.

For topic constructions, we propose a new feature CONTEXT | TOPIC, whose index in this case is token identical to the INDEX value of ta. Please be advised that in the above structure, the CONTEXT | TOPIC ta is considered as a sentential adjunct instead of a complement subcated-for by shenti. Why? First, ta is highly optional: topic-less sentence is still a sentence. Second, and more convincingly, ta cannot always be predicted by its following noun. Compare:

20a) 他身体好。

ta         shenti   hao
he        body    good
He is good in health.

20b) 他好身体。

ta         hao      shenti
he        good    body
He is of good health.

21a) 他脾气好。

ta         piqi                  hao
he        disposition       good
He is good in disposition.

21b)  他好脾气。

ta         hao      piqi
he        good    disposition
He is of good disposition.


22a) 她学习好。

ta         xuexi   hao. [3]
he        study   good
He is good in study.

22b) *  他好学习。

ta         hao      xuexi
he        good    study

What this shows is that for topic sentences like ta shenti hao (He is good in health), ta xuexi hao (He is good in study), etc., there is no requirement to regard topic ta (he) as a necessary semantic possessor of shenti / xuexi, the relation is rather “in-aspect”: something (NP1) is good (A) in some aspect (NP2), or for something (NP1), some aspect (NP2) is good (A).

Finally, it needs to be mentioned that our proposed lexical rule requires modification to accommodate sentence 6). That is already beyond what we can reach in this paper because it is integrated with the way we handle Chinese classifiers in HPSG framework.



Pollard, Carl  & Sag, Ivan A. (1994): Head-Driven Phrase Structure Grammar,  Centre for the Study of Language and Information, Stanford University, CA

Pollard, Carl & Sag, Ivan A. (1987): Information‑based Syntax and Semantics Vol. 1: Fundamentals. Centre for the Study of Language and Information, Stanford University, CA

Wilks, Y.A. (1975a): A Preferential Pattern-Seeking Semantics for Natural Language Interference.  Artificial Intelligence, Vol. 6, pp.53-74.

Wilks, Y.A. (1975b): An Intelligent Analyzer and Understander of English, in Communications of the ACM, Vol. 18, No.5, pp.264-274

Wilks, Y.A. (1978): Making Preferences More Active.  Artificial Intelligence, Vol. 11, pp. 197-223

~~~~~~~~~~~~~~~ footnotes ~~~~~~~~~~~~~~~~

[1] This is not absolute, we do have the following examples:

Ia)          约翰是纽约人。

Yuehan shi           Niuyue                   ren
John       be            New-York              person
John is a New Yorker.

Ib)           约翰纽约人。

Yuehan  Niuyue                   ren.
John       New-York              person
John is a New Yorker.

IIa)         今天是星期天。

jintian    shi           xingqi-tian.
today     be            Sun-day
Today is Sunday.

IIb)         今天星期天。

jintian    xingqi-tian.
today     Sun-day
Today is Sunday.

It seems to be that the subject NP stands for some individual element(s), and the predicate NP describes a set (property) where the subject belongs. But it is not clear how to capture Ib) and IIb) while excluding 7b). We leave this question open.

[2] We realize that the syntactic constraint defined here is only a rough approximation to the data from syntactic angle. It seems to match most data, but there are exceptions when yi (one) appears in a numeral-classifier phrase:

IIIa)  他一副好身体。

ta            yi             fu            hao         shenti.
he            one         (cl.)         good       body
He is of good health. (He is of a good body.)

IIIb) * 他三副好身体。

ta            san          fu            hao         shenti
he            three       (cl.)         good       body

IIIc)   他好身体。

ta            hao         shenti.    [same as 1) ]

IVa) 李四一张圆圆的脸。

Lisi          yi             zhang     yuanyuan             de            lian.
Lisi          one         (cl.)         round-round         DE          face
Lisi has a quite round face.

IVb) * 李四两张圆圆的脸。

Lisi          liang       zhang     yuanyuan             de            lian.
Lisi          two         (cl.)         round-round         DE          face

IVc)  李四圆圆的脸。

Lisi          yuanyuan             de            lian.        [ same as 3) ]

[3] Another reading for 22a) is [S [Sta xuexi][AP hao]], where ta xuexi is a subject clause: “That he studies is good”. This is another issue.



Interaction of syntax and semantics in parsing Chinese transitive verb patterns 

Notes for An HPSG-style Chinese Reversible Grammar

Outline of an HPSG-style Chinese reversible grammar

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

PhD Thesis: Chapter I Introduction

PhD Thesis: Chapter II Role of Grammar

PhD Thesis: Chapter III Design of CPSG95

PhD Thesis: Chapter IV Defining the Chinese Word

PhD Thesis: Chapter V Chinese Separable Verbs

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

PhD Thesis: Chapter VII Concluding Remarks

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

Notes for An HPSG-style Chinese Reversible Grammar


Key words: Chinese parsing, Chinese generation, reversible grammar,  HPSG

This paper presents a reversible Chinese unification grammar named CPSG. The lexicalized and integrated design of CPSG embodies the general spirit of the modern linguistic theory Head-driven Phrase Structure Grammar (HPSG, Pollard & Sag 1987, 1994). Using ALE formalism in Prolog (Carpenter & Penn 1994), we have implemented a prototype of CPSG.

CPSG covers Chinese morphology, Chinese syntax and semantics in a novel integrated language model (Figure 1, for interface between morphology, see Li 1997; for interface between syntax and semantics, see Li 1996). CPSG model is in sharp contrast to the conventional clear-cut successive design of grammar components (Figure 2, see survey in Feng 1996). We will show that our model is much better suited and more efficient for Chinese analysis (or generation).



Grammar reversibility is a highly desired feature for multi-lingual machine translation application (Hutchins & Somers 1992, Huang 1986, 1987). To test its reversible features, we have applied the CPSG prototype to an experiment of bi-directional machine translation between English and Chinese. The machine translation engine developed in our Natural Language Lab is based on shake-and-bake design, a novel approach to machine translation suited for unification grammars (Whitelock 1992, 1994, Beaven 1992, Brew 1992). The experimental results meet our design objective and verify the feasibility of CPSG approach.


Notes for NWLC-97, UBC, Vancouver

Outline of An HPSG-style Chinese Reversible Grammar

Wei LI   (lio@sfu.ca)

Linguistics Department, Simon Fraser University


 Key words:          lexicalist approach, integrated language model, HPSG,

                                reversible grammar,  bi-directional machine translation, 

                                Chinese computational grammar,

                                Chinese word identification, Chinese parsing,
Chinese generation


  1. background

1.1. design philosophy

Two major obstacles in writing Chinese computational grammar:

lacking in serious study on Chinese lexical base

well designed lexicon is crucial for a successful computational system

theoretical linguists have made fruitful efforts (e.g. Li Linding) but lack formalization

computational linguists require more patience in adapting and formalizing the fruits:

it is huge work, but has to be done if a non-toy system is targeted

lack of effective interaction between morphology, syntax and semantics.


ambiguity in word identification makes it hard to interface morphology & syntax:

a theoretical defect of morphology preprocessor (segmenter)

e.g. ABC: ABC or A | BC or AB | C or A | B | C?

active/passive isomorphic phenomena make semantic constraint a desired need in parsing NP Vt: subject NP or object NP?

Solution: the lexicalized and integrated design of Chinese grammar

1.2. major theoretical foundation:

HPSG:       lexicalist theory encouraging integration of different components

a desired framework matching our design philosophy

CPSG: HPSG-style unification grammar

CPSG: reversible grammar suited for both parsing and generation

CPSG: formalized grammar, a description that does not rely on undefined notions

  1. integrated language model

2.1. CPSG versus conventional Chinese grammar



parse tree embodies both morphological and syntactic structures in CPSG

  1. lexicalized formal grammar

3.1. formalized grammar, as required by a computational grammar: formulation of CPSG

readily implementable (theories, principles, rules, etc.);

precise definition for the very basic notions (e.g. sign, morpheme, word, phrase, sentence, NP, VP, etc.), rules (PS rules and lexical rules), lexical items (lexical hierarchy), typology (hierarchy embodied in feature structures)

(4.)       Definition: sign

A sign is the most fundamental concept of grammar. Formally, a sign is defined by the type [a_sign], which introduces a set of linguistic features for its description, as shown below.

INDEX index
KANJI kanji
MORPH1 expected
MORPH2 expected
CATEGORY category
COMP0 expected
COMP1 expected
COMP2 expected
MOD expected
KNOWLEDGE knowledge
CONTENT content
INDEX0 index
INDEX1 index
INDEX2 index
DTR dtr

(5.)       Definition: word

In CPSG, a word is a sign satisfying the following two conditions: (1) its obligatory morphological expectation has all been saturated; (2) it is not a mother of any syntactic structures, hence no syntactic daughters. Formally, a word is defined as shown below.

(6.)       word

MORPH1 ~obligatory
MORPH2 ~obligatory
DTR no_syn_dtr

3.2. lexicalized grammar

CPSG consists of two parts:

(1) a minimized general grammar:

only 11 phrase structure rules
(covering complement structure, modifier structure,
conjunctive structure and morphological structure)

(2) a feature enriched lexicon:

lexical entries;
lexical hierarchy and a set of lexical rules
(capturing lexical generalizations).


(7.)          comp0 PS rule

MOTHER               a_sign
COMP0 saturated
COMP1 [1]
COMP2 [2]
DTR comp0
LEFTMOD [7] category
RIGHTMOD [8] category
LEFTCOMP [9] category
RIGHTCOMP [10] category


EXPECTING          a_sign
COMP0 a_expected
ROLE [3]
SIGN [4]
COMP1 [1] ~obligatory
COMP2 [2] ~obligatory
DTR dtr

EXPECTED            a_sign [4]
CONTENT content
MYROLE [3] comp_role

PRINCIPLE            #head_feature

(8.)          lexical entry: chi

KANJI one_character
H1 chi
INDEX0 [1] index
INDEX1 [2] index
COMP0 a_expected
SIGN a_sign
COMP1 a_expected
SIGN a_sign
MALE none
U_SUBJECT animate
MALE bin

  1. Implementation and Application of CPSG

CPSG prototype implemented in ALE and Prolog, having parsed a corpus of 200 various types of sentences

ALE and Prolog: suitable for unification grammar
ALE:         mechanism for typed feature structures: type polymorphism
a powerful tool in language modeling

CPSG prototype adapted for application to bi-directional MT, having generated the same corpus of 200 sentences


Beaven, John L. (1992): “Shake and Bake Machine Translation”, Proceedings of the 15th International Conference on Computational Linguistics, pp. 603-609, Nantes, France.

Brew, Chris (1992): “Letting the Cat out of the Bag: Generation for Shake-and-bake MT”, Proceedings of the 15th International Conference on Computational Linguistics, pp. 610-616, Nantes, France.

Carpenter, B. & Penn, G. (1994): ALE, The Attribute Logic Engine, User’s Guide

Feng, Z.  (1996): “COLIPS Lecture Series – Chinese Natural Language Processing”,  Communications of COLIPS, Vol.6, No.1 1996, Singapore (http://www.iscs.nus.sg/~colips/commcolips/paper/p96.html)

Huang, X-M. (1986): “A Bidirectional Grammar for Parsing and Generating Chinese”.  Proceedings of the International Conference on Chinese Computing, Singapore, pp. 46-54

Huang, X-M. (1987): XTRA: The Design and Implementation of A Fully Automatic Machine Translation System, Doctoral dissertation, University of Essex.

Hutchins, W.J. & H.L. Somers (1992): An Introduction to Machine Translation. London, Academic Press.

Li, W. (1996): Interaction of Syntax and Semantics in Parsing Chinese Transitive Patterns. Proceedings of International Conference on Chinese Computing (ICCC’96), Singapore

Li, W. (1997): Chart Parsing Chinese Character Strings. Proceedings of The Ninth North American Conference on Chinese Linguistics (NACCL-9, to be available), Victoria, Canada

Pollard, C.  & I. Sag (1987): Information based Syntax and Semantics Vol. 1: Fundamentals. Centre for the Study of Language  and Information, Stanford University, CA

Pollard, C.  & I. Sag (1994): Head-Driven Phrase Structure Grammar,  Centre for the Study of Language and Information, Stanford University, CA

Whitelock, Pete (1992): “Shake and Bake Translation”, Proceedings of the 14th International Conference on Computational Linguistics, pp. 784-790, Nantes, France.

Whitelock, Pete (1994). “Shake and Bake Translation”, C.J. Rupp, M.A. Rosner, and R.L. Johnson (eds.), Constraints, Language and Computation, pp. 339-359, London, Academic Press.



Outline of an HPSG-style Chinese reversible grammar

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

PhD Thesis: Chapter I Introduction

PhD Thesis: Chapter II Role of Grammar

PhD Thesis: Chapter III Design of CPSG95

PhD Thesis: Chapter IV Defining the Chinese Word

PhD Thesis: Chapter V Chinese Separable Verbs

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

PhD Thesis: Chapter VII Concluding Remarks

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

Outline of an HPSG-style Chinese reversible grammar

 Outline of an HPSG-style Chinese reversible grammar*

Wei  LI
Simon Fraser University

This paper presents the outline and the design philosophy of a lexicalized Chinese unification grammar named W‑CPSG. W‑CPSG covers Chinese morphology, Chinese syntax and semantics in a novel integrated language model. The grammar works reversibly, suited for both parsing and generation. This work is developed in the general spirit of the linguistic theory Head-driven Phrase Structure Grammar (Pollard & Sag 1994). We identify the following two problems as major obstacles in formulating a precise and efficient Chinese grammar. First, we lack in serious study on Chinese lexical base and often jump too soon for linguistic generalization. Second, there is a lack of effective interaction and adequate interface between morphology, syntax and semantics. We address these problems in depth with the lexicalized and integrated design of W‑CPSG. We will also illustrate how W‑CPSG is formalized and how it works.


  1. Background

Unification grammars have been extensively studied in the last decade (Shieber 1986). Implementations of such grammars for English are being used in a wide variety of applications. Attempts also have been made to write Chinese unification grammars (Huang 1986, among others). W‑CPSG (for Wei’s Chinese Phrase Structure Grammar, Li, W. 1997b) is a new endeavor in this direction, with its unique design and characteristics.

1.1. Design philosophy

We identify the following two problems as major obstacles in formulating a precise and efficient Chinese grammar. First, we lack in serious study on Chinese lexical base and often jump too soon for linguistic generalization. Second, there is a lack of effective interaction and adequate interface between morphology, syntax and semantics. We address these problems in depth with the lexicalized and integrated design of W‑CPSG.

1.1.1. Lexicalized design

It has been widely accepted that a well-designed lexicon is crucial for a successful grammar, especially for a natural language computational system. But Chinese linguistics in general and Chinese computational grammars in particular have generally been lacking in in-depth research on Chinese lexical base. For many years, most dictionaries published in China did not even contain information for grammatical categories in the lexical entries (except for a few dictionaries intended for foreign readers learning Chinese). Compared with the sophisticated design and rich linguistic information embodied in English dictionaries like Oxford Advanced Learners’ Dictionary and Longman Dictionary of Contemporary English, Chinese linguistics is hampered by the lack of such reliable lexical resources.

In the last decade, however, Chinese linguists have achieved significant progress in this field. The publication of 800 Words in Contemporary Mandarin (Lü et al., 1980) marked a milestone for Chinese lexical research. This book is full of detailed linguistic description of the most frequently used Chinese words and their collocations. Since then, Chinese linguists have made fruitful efforts, marked by the publication of a series of valency dictionaries (e.g. Meng et al., 1987) and books  (e.g. Li, L. 1986, 1990). But almost all such work was done by linguists with little knowledge of computational linguistics. Their description lacks formalization and consistency. Therefore, Chinese computational linguists require patience in adapting and formalizing these results, making them implementable.

1.1.2. Integrated design

Most conventional grammars assume a successive model of morphology, syntax and semantics. We argue that this design is not adequate for Chinese natural language processing. Instead, an integrated grammar of morphology, syntax and semantics is adopted in W‑CPSG.

Let us first discuss the rationale of integrating morphology and syntax in Chinese grammar. As it stands, a written Chinese sentence is a string of characters (morphemes) with no blanks to mark word boundaries. In conventional systems, there is a procedure-based Chinese morphology preprocessor (so-called segmenter). The major purpose for the segmenter is to identify a string of words to feed syntax. This is not an easy task, due to the possible involvement of the segmentation ambiguity. For example, given a string of 4 Chinese characters da xue sheng huo, the segmentation ambiguity is shown in (1a) and (1b) below.

(1)                    da xue sheng huo

(a)        da-xue                          | sheng-huo
university                    | life

(b)        da-xue-sheng               | huo
university-student       | live

The resolution of the above ambiguity in the morphology preprocessor is a hopeless job because such structural ambiguity is syntactically conditioned. For sentences like da xue sheng huo you qu (university life is interesting), (1a) is the right identification. For sentences like da xue sheng huo bu xia qu le (university students cannot make a living), (1b) is right. So far there are no segmenters which can handle this properly and guarantee correct word segmentation (Feng 1996). In fact, there can never be such segmenters as long as syntax is not brought in. This is a theoretical defect of all Chinese analysis systems in the morphology-before-syntax architecture (Li, W. 1997a). I have solved this problem in our morphology-syntax integrated W‑CPSG (see 2.2. below).

Now we examine the motivation of integrating syntax and semantics in Chinese grammar. It has been observed that, compared with the analysis of Indo-European languages, proper Chinese analysis relies more heavily on semantic information (see, e.g. Chen 1996, Feng 1996). Chinese syntax is not as rigid as languages with inflections. Semantic constraint is called for in both structural and lexical disambiguation as well as in solving the problem of computational complexity.  The integration of syntax and semantics helps establish flexible ways of their interaction in analysis (see 2.3. below).

1.2. Major theoretical foundation: HPSG

The work on W‑CPSG is developed in the spirit of the linguistic theory Head-driven Phrase Structure Grammar (HPSG, proposed by Pollard & Sag, 1987). HPSG is a highly lexicalist theory, which encourages the integration of different components. This matches our design philosophy for implementing our Chinese computational grammar. HPSG serves as a desired framework to start this research with. We benefit most from the general linguistic ideas in HPSG. However, W‑CPSG is not confined to the theory-internal formulations of principles and rules and other details in HPSG versions (e.g. Pollard & Sag 1987, 1994 or later developments). We borrow freely from other theoretical sources or form our own theories in W‑CPSG to meet our goal of Natural Language Processing in general and Chinese computing in particular. For example, treating morphology as an integrated part of parsing and placing it right into grammar is our deliberate choice. In syntax, we formulate our own theory for configuration and word order. Our semantics differs most from any standard version of situation-semantics-based theory in HPSG. It is based on insights from Tesnière’s Dependency Grammar (Tesnière 1959), Fillmore’s Case Grammar (Fillmore 1968) and  Wilks’ Preference Semantics (Wilks 1975, 1978) as well as our own semantic view for knowledge representation and better coordination of syntax-semantics interaction (Li, W. 1996). For these differences and other modifications, it is more accurate to regard W‑CPSG as an HPSG-style Chinese grammar, rather than an (adapted) version of Chinese HPSG.

  1. Integrated language model

2.1. W‑CPSG versus conventional Chinese grammar

The lexicalized design sets the common basis for the organization of the grammar in W‑CPSG. This involves the interfaces of morphology, syntax and semantics.[1]   W‑CPSG assumes an integrated language model of its components (see Figure 1).  The W‑CPSG model is in sharp contrast to the conventional clear-cut successive design of grammar components (see Figure 2).



Figure 2.  conventional language model (non-reversible)

2.2. Interfacing morphology and syntax

As shown in Figure 2 above, conventional  systems take a two-step approach: a procedure-based preprocessor for word identification (without discovering the internal structure) and a grammar for word-based parsing. W‑CPSG takes an alternative one-step approach and the parsing is character- (i.e. morpheme-) based. A morphological PS (phrase structure) rule is designed not only to identify candidate words but to build word‑internal structures as well. In other words, W‑CPSG is a self-contained model, directly accepting the input of a character string for parsing. The parse tree embodies both the morphological analysis and the syntactic analysis, as illustrated by the following sample parsing chart.


Note:    DET for determiner; CLA for classifier; N for noun; DE for particle de;
AF for affix; V for verb; A for adjective; CLAP for classifier phrase;
NP for noun phrase; DEP for DE-phrase

This is so-called bottom-up parsing. It starts with lexicon look-up. Simple edges 1 through 7 are lexical edges. Combined edges are phrasal edges. Each edge represents a sign, i.e. a character (morpheme), a word, a phrase or a sentence. Lexical edges result from a successful match between the signs in the input string and the entries in the lexicon during lexicon look-up. After looking up the lexicon, the lexical information for the signs are made available to the parser. For the sake of concise illustration, we only show two crucial pieces of information for each edge in the chart, namely category and interpretation with a delimiting colon (some function words are only labeled for category). The parser attempts to combine the edges according to PS rules in the grammar until a parse is found. A parse is an edge which ranges over the whole string. The parse ((((1+2)+3)+4)+((5+6)+7)) represents the following binary structural tree embodying both the morphological and syntactic analysis of this NP phrase.


As seen, word identification is no longer a pre-condition for parsing. It becomes a natural by-product of parsing in this integrated grammar of morphology and syntax: a successful parse always embodies the right word identification. For example, the parse ((((1+2)+3)+4)+((5+6)+7)) includes the identification of a word-string zhe (DET) ben (CLA) shu (N) de (DE) ke-du-xing (N). An argument against the conventional separation model is that there exists in the two-step approach a theoretical threshold beyond which the precision for the correct word identification is not possible. This is because proper word identification in Chinese is to a considerable extent syntactically conditioned due to  possible structural ambiguity involved. Our strategy has advantages over the conventional approach  in  resolving word identification ambiguities and in handling the productive word formation. It has solved the problems inherent in the morphology-before-syntax architecture (for detailed argumentation, see Li, W. 1997a).

2.3. Interaction of syntax and semantics

The interface and interaction of syntax and semantics are of vital importance in a Chinese grammar. We are of the same opinion as Chen (1996) and many others that it is more effective to analyze Chinese in an environment where semantic constraints are enforced during the parsing, not after. The argument is based on the linguistic characteristics of Chinese. Chinese has no inflection (like English ‑’s, ‑s, ‑ing, ‑ed, etc.), no such formatives as article (like English a, the), infinitivizer (like English to) and complementizer (like English that). Instead, function words and word order are used as major syntactic devices. But Chinese function words (prepositions, aspect particles, passive particle, plural suffix, conjunctions, etc.) can often be omitted (Lü et al. 1980, p.2). Moreover, fixed word order in order to mark syntactic functions which is usually assumed for isolating languages, is to a considerable extent untrue for Chinese. In fact, there is remarkable freedom or flexibility in Chinese word order. One typical example is demonstrated in the numerous word order variations (although the default order is S‑V‑O subject-verb-object) for the Chinese transitive patterns  (Li, W. 1996).  All these added up project a picture of Chinese as a language of loose syntactic constraint. A weak syntax requires some support beyond syntax to enhance grammaticality. Semantic constraints are therefore called for. I believe that an effective way to model this interaction between syntax and semantics is to integrate the two in one grammar.

One strong piece of evidence for this syntax-semantics integration argument is that Chinese has what I call syntactically crippled structures. These are structures which can hardly be understood on purely formal grounds and are usually judged as ungrammatical unless accompanied with the support from the semantic constraints (i.e. the match of semantic selection restrictions). Some Chinese NP predicate (Li, W. & McFetridge 1995) and transitive patterns like S‑O‑V (Li, W. 1996), among others, are such structures. The NP Predicate is a typical instance of semantic dependence. It is highly undesirable if we assume a general rule like S –> NP1 NP2 in a Chinese grammar to capture such phenomena. This is because there is a semantic condition for NP2 to function as predicate, which makes the Chinese NP predicate a very restricted pattern. For example, in the sentence This table is three-legged: zhe (this) zhang (classifier) zhuo-zi (desk) san (three) tiao (classifier) tui (leg), the subject must be of the semantic type animate or furniture (which can have legs). The general rule with no recourse to semantic constraints is simply too productive and may cause severe computational complexity. In the case of Chinese transitive patterns, formal means are decisive for some variations in their interpretation (i.e. role assignment) process. But others are heavily dependent on semantic constraint. Take chi (eat) as an example. There is no difference in syntactic form in sentences like wo (I) chi (eat) dianxin (Dim-Sum) le (perfect-aspect) and dianxin (Dim-Sum) wo (I) chi (eat) le (perfect-aspect). Who eats what? To properly assign roles to NP1 NP2 V as S-O-V versus O-S-V, the semantic constraint animate eats food needs to be enforced.

The conventional syntax-before-semantics model has now received less popularity in Chinese computing community. Researchers have been exploring various ways of integrating syntax and semantics in Chinese grammar (Chen 1996). In W‑CPSG, the Chinese syntax was enhanced by the incorporation of a semantic constraint mechanism. This mechanism embodies a lexicalized knowledge representation, which parallels to the syntactic representation in the lexicon. I have developed a way to dynamically coordinate the syntactic constraint and semantic constraint in one model. This technique proves to be effective in handling rhetorical expressions and in making the grammar both precise and robust (Li, W 1996).


  1. Lexicalized formal grammar

3.1. Formalized grammar

The application nature of this research requires that we pay equal attention to practical issues of computational systems as well as to a sound theoretical design. All theories and rule formulations in W‑CPSG are implementable. In fact. most of them have been implemented in our prototype W‑CPSG. W‑CPSG is a strictly formalized grammar that does not rely on undefined notions. The whole grammar is represented by typed feature structures (TFS), as defined below based on Carpenter & Penn (1994).

(3)        Definition: typed feature structure 

A typed feature structure is a data structure adopted to model a certain object of a grammar. The necessary part for a typed feature structure is type. Type represents the classification of the feature structure. A simple feature structure contains only the type information, but a complex feature structure can introduce a set of feature-value pairs in addition to the type. A feature-value pair consists of a feature and a value. A feature reflects one aspect of an object. The value describes that aspect. A value is itself a feature structure (simple or complex). A feature determines which type of feature structures it takes as its value. Typed feature structures are finite in a grammar. Their definition constitutes the typology of the grammar.

With this formal device of typed feature structures, we formulate W‑CPSG by defining from the very basic notions (e.g. sign, morpheme, word, phrase, S, NP, VP, etc.) to rules (PS rules and lexical rules), lexical items, lexical hierarchy and typology (hierarchy embodied in feature structures) (Li, W. 1997b). The following sample definitions of some basic notions illustrate the formal nature of W‑CPSG. Please note that they are system-internal definitions and are used in W‑CPSG to serve the purpose of configurational constraints (see Chapter VI of Li, W. 1997b).

(4)        Definition: sign [2]

KANJI kanji
MORPH expected
CATEGORY category
COMP0 expected
COMP1 expected
COMP2 expected
MOD expected
KNOWLEDGE knowledge
CONTENT content
DTR dtr

A sign is the most fundamental concept of grammar. A sign is a dynamic unit of grammatical analysis. It can be a morpheme, a word, a phrase or a sentence. Formally, a sign is defined by the TFS a_sign, which introduces a set of linguistic features for its description, as shown above. These features include the orthographic feature KANJI; morphological feature MORPH; syntactic features CATEGORY, COMP0, COMP1, COMP2, and MOD; structural feature (for both morphology and syntax) DTR; semantic features KNOWLEDGE and CONTENT.

(5)        Definition: morpheme

MORPH ~saturated

A morpheme is a sign whose morphological expectation has not been saturated. In W‑CPSG, ~saturated is equivalent to obligatory/optional/null. For example, the suffix ‑xing (‑ness) is such a morpheme whose morphological expectation for a preceding adjective is obligatory.  In W‑CPSG, a morpheme like ‑xing (‑ness) ceases to be a morpheme when its obligatory expectation, say the adjective ke-du (readable), is saturated. Therefore, the sign ke-du-xing (readability) is not a morpheme, but becomes a word per se.

(6)        Definition: word

MORPH ~obligatory
DTR no_syn_dtr

In W‑CPSG, ~obligatory is equivalent to saturated/optional/null. The specification [MORPH ~obligatory] defines a syntactic sign, i.e. a sign whose obligatory morphological expectation has been saturated. A word is a syntactic sign with no syntactic daughters, i.e. [DTR no_syn_dtr]. Obviously, word with [MORPH saturated/optional/null] overlaps morpheme with [MORPH obligatory/optional/null] in cases when the morphological expectation is optional or null.

Just like the overlapping of morpheme and word, there is also an intersection between word and phrase. Compare the following definition of phrase with the above definition of word.

(7)        Definition: phrase

MORPH ~obligatory
COMP0 ~obligatory
COMP1 ~obligatory
COMP2 ~obligatory 

A phrase is a syntactic sign whose obligatory complement expectation has all been saturated, i.e. [COMP0 ~obligatory, COMP1 ~obligatory, COMP2 ~obligatory]. When a word has only optional complement expectation or no complement expectation, it is also a phrase. The overlapping relationship among morpheme, word and phrase can be shown by the following illustration of the three sets.


S is a syntactic sign satisfying the following 3 conditions: (1) its category is pred (which includes V and A); (2) its comp0 is saturated; (3) its obligatory comp1 and comp2  are saturated.

3.2. Lexicalized grammar

W‑CPSG takes a radical lexicalist approach. We started with individual words in the lexicon and have gradually built up a lexical hierarchy and the grammar prototype.

W‑CPSG consists of two parts: a minimized general grammar and a information-enriched lexicon. The general grammar contains only 11 PS rules, covering complement structure, modifier structure, conjunctive structure and morphological structure. We formulate a PS rule for illustration.


This comp0 PS rule is similar to the rule S ==> NP VP in the conventional phrase structure grammar. The feature COMP0 represents the expectation of the head daughter for its external complement (subject or specifier) on its left side, i.e. [DIRECTION left]. The nature of its expected comp0, NP or other types of sign, is lexically decided by the individual head (hence head-driven or lexicon-driven). It will always be warranted by the general grammar, here via the index [3]. This is the nature of lexicalized grammars. PS rules in such grammars are very abstract. Essentially, they say one thing, namely, 2 signs can combine so long as the lexicon so indicates. The indices [1] and [2] represent configurational constraint. They ensure that internal obligatory complements COMP1 and COMP2 must be saturated before this rule can be applied. Finally, Head Feature Principle (defined elsewhere in the grammar based on the adaptation of the Head Feature Principle in HPSG, Pollard & Sag, 1994) ensures that head features are percolated up from the head daughter to the mother sign.

The lexicon houses lexical entries with their linguistic description and knowledge representation. Potential morphological structures, as well as potential syntactic structures, are lexically encoded (in the feature MORPH for the former and in the features COMP0, COMP1, COMP2, MOD for the latter). Our knowledge representation is also embodied in the lexicon (in the feature KNOWLEDGE). I believe that this is an effective and realistic way of handling natural language phenomena and their disambiguation without having to resort to an encyclopedia-like knowledge base. The following sample formulation of the lexical entry chi (eat) projects a rough picture of what the W‑CPSG lexicon looks like.


The lexicon also contains lexical generalizations. The  generalizations are captured by the inheritance of the lexical hierarchy and by a set of lexical rules. Due to space limitations, I will not show them in this paper.

  1. Implementation and application of W‑CPSG

A substantial Chinese computational grammar has been implemented in the W‑CPSG prototype.  It covers all basic Chinese constructions. Particular attention is paid to the handling of function words and verb patterns.  On the basis of the information- enriched lexicon and the general grammar, the system adequately handles the relationship between linguistic individuality and generality. The grammar formalism which I use to code W‑CPSG is ALE, a grammar compiler on top of Prolog, developed by Carpenter & Penn (1994). ALE  is equipped with an inheritance mechanism on typed feature structures, a powerful tool in grammar modeling. I have made extensive use of the mechanism in the description of lexical categories as well as in knowledge representation. This seems to be an adequate way of capturing the inherent relationship between features in a grammar. Prolog is a programming environment particularly suitable for the development of unification and reversible grammars (Huang 1986, 1987). ALE compiles W‑CPSG into a Chinese parser, a Prolog program ready to accept a string of characters for analysis. In the first experiment, W‑CPSG has parsed a corpus of 200 Chinese sentences of various types.

An important benefit of a unification-based grammar is that the same grammar can be used both for parsing and generation. Grammar reversibility is a highly desired feature for multi-lingual machine translation application. Following this line, I have successfully applied W‑CPSG to the experiment of bi-directional machine translation between English and Chinese. The machine translation system developed in our Natural Language Lab is based on the shake-and-bake design (Whitelock 1992, 1994). I used the same three grammar modules (W‑CPSG, an English grammar and a bilingual transfer lexicon) and the same corpus for the experiment. As part of machine translation output, W‑CPSG has successfully generated the 200 Chinese sentences. The experimental results meet our design objective and verify the feasibility of our approach.




Carpenter, B. & Penn, G. (1994): ALE, The Attribute Logic Engine, User’s Guide

Chen, K-J.  (1996): “Chinese sentence parsing” Tutorial Notes for International Conference on Chinese Computing ICCC’96, Singapore

Feng, Z-W.  (1996): “COLIPS lecture series – Chinese natural language processing”,  Communications of COLIPS, Vol. 6, No. 1 1996, Singapore

Fillmore, C. J. (1968): “The case for case”. Bach and Harms (eds.), Universals in Linguistic Theory. Holt, Reinhart and Winston, pp. 1-88.

Huang, X-M. (1986): “A bidirectional grammar for parsing and generating Chinese”.  Proceedings of the International Conference on Chinese Computing, Singapore, pp. 46-54

Huang, X-M. (1987): XTRA: The Design and Implementation of A Fully Automatic Machine Translation System, Doctoral dissertation, University of Essex.

Li, L-D. (1986): Xiandai Hanyu Juxing (Sentence Patterns in Contemporary Mandarin), Shangwu Yinshuguan, Beijing

Li, L-D. (1990): Xiandai Hanyu Dongci (Verbs in Contemporary Mandarin), Zhongguo Shehui Kexue Chubanshe, Beijing

Li, W. & P. McFetridge (1995): “Handling Chinese NP predicate in HPSG”, Proceedings of PACLING-II, Brisbane, Australia

Li, W. (1996): “Interaction of syntax and semantics in parsing Chinese transitive patterns”, Proceedings of International Conference on Chinese Computing (ICCC’96), Singapore

Li, W. (1997a): “Chart parsing Chinese character strings”, Proceedings of The Ninth North American Conference on Chinese Linguistics (NACCL-9, to be available), Victoria, Canada

Li, W. (1997b): W‑CPSG: A Lexicalized Chinese Unification Grammar, Doctoral dissertation, Simon Fraser University (on-going)

Lü, S-X. et al. (ed.) (1980): Xiandai Hanyu Babai Ci (800 Words in Contemporary Mandarin), Shangwu Yinshuguan, Beijing

Meng, Z., H-D. Zheng, Q-H. Meng, & W-L. Cai (1987): Dongci Yongfa Cidian (Dictionary of Verb Usages), Shanghai Cishu Chubanshe, Shanghai

Pollard, C.  & I. Sag (1987): Information based Syntax and Semantics Vol. 1: Fundamentals. Centre for the Study of Language  and Information, Stanford University, CA

Pollard, C.  & I. Sag (1994): Head-Driven Phrase Structure Grammar,  Centre for the Study of Language and Information, Stanford University, CA

Shieber, S. (1986): An Introduction to Unification-Based Approaches to Grammar. Centre for the Study of Language  and Information, Stanford University, CA

Tesnière, L. (1959): Éléments de Syntaxe Structurale, Paris: Klincksieck

Whitelock, Pete (1992): “Shake and bake translation”, Proceedings of the 14th International Conference on Computational Linguistics, pp. 784-790, Nantes, France.

Whitelock, Pete (1994). “Shake and bake translation”, C.J. Rupp, M.A. Rosner, and R.L. Johnson (eds.), Constraints, Language and Computation, pp. 339-359, London, Academic Press.

Wilks, Y.A. (1975). “A preferential pattern-seeking semantics for natural language interference”.  Artificial Intelligence, Vol. 6, pp. 53-74.

Wilks, Y.A. (1978). “Making preferences more active”.  Artificial Intelligence, Vol. 11,  pp. 197-223



* This project was supported by the Science Council of British Columbia, Canada under G.R.E.A.T. Award (code: 61) and by my industry partner TCC Communications Corporation, British Columbia, Canada. I thank my academic advisors Paul McFetridge and Fred Popowich and my industry advisor John Grayson for their supervision and encouragement. Thanks also go to my colleagues Davide Turcato, James Devlan Nicholson and Olivier Laurens for their help during the implementation of this grammar in our Natural Language Lab. I am also grateful to the editors of the NWLC’97 Proceedings for their comments and corrections.

[1] We leave aside the other components such as discourse, pragmatics, etc. They are an important part of a grammar for a full analysis of language phenomena, but they are beyond what can be addressed in this research.

[2] In formulating W‑CPSG, we use uppercase for feature and lowercase for type; ~ for logical not and / for logical or; number in square brackets for unification.



Outline of An HPSG-style Chinese Reversible Grammar ABSTRACT

PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

PhD Thesis: Chapter VII Concluding Remarks

This chapter summarizes the research conducted in this dissertation, including its contributions as well as limitation.

7.0. Summary

The goal of this dissertation is to explore effective ways of formally approaching Chinese morpho-syntactic interface in a phrase structure grammar.  This research has led to the following results:  (i) the design of a Chinese grammar, namely CPSG95, which enables flexible coordination and interaction of morphology and syntax;  (ii) the solutions proposed in CPSG95 to a series of long-standing problems at the Chinese morpho-syntactic interface.

CPSG95 was designed in the general framework of HPSG (Pollard and Sag 1987, 1994).  The sign-based mono-stratal design from HPSG demonstrates the advantage in being capable of accommodating and accessing information of different components of a grammar.  One crucial feature of CPSG95 is its introduction of morphology expectation feature structures and the corresponding morphological PS rules into HPSG.  As a result, CPSG95 has been demonstrated to provide a favorable environment for solving morpho-syntactic interface problems.

Three types of morpho-syntactic interface problems have been studied extensively: (i) the segmentation ambiguity in Chinese word identification;  (ii) Chinese separable verbs, a borderline problem between compounding and syntax; and (iii) borderline phenomena between derivation morphology and syntax.

In the context of the CPSG95 design, the segmentation ambiguity is no longer a problem as morphology and syntax are designed system internally in the grammar to support morpho-syntactic parsing based on non-deterministic tokenization (W. Li 1997, 2000).  In other words, the design of CPSG95 itself entails an adequate solution to this long-standing problem, a problem which has been a central topic in Chinese NLP for the last two decades.  This is made possible because the access to a full grammar including both morphology and syntax is available in the integrated process of Chinese parsing and word identification while traditional word segmenters can at best access partial grammar knowledge.[1]

The second problem involves an interesting case between compounding and syntax:  different types of Chinese separable verbs demonstrate various degrees of separability in syntax while all these verbs, when used contiguously, are part of Chinese verb vocabulary.  For each type of separable verbs, arguments were presented for the proposed linguistic analysis and a solution to the problem was then formulated in CPSG95 based on the analysis.  All the proposed solutions provide a way of capturing the link between the separated use and the contiguous use of the separable verb phenomena.  They are shown to be better solutions than previous approaches in the literature which either cannot link the separated use and the contiguous use in the analysis or suffer from being not formal.

The third problem at the interface of derivation and syntax involves two issues: (i) a considerable amount of ‘quasi-affix’ data, and (ii) the intriguing case of zhe-suffixation which demonstrates an unusual combination of a phrase with a bound morpheme.  A generic analysis of Chinese derivation has been proposed in CPSG95.  This analysis has been demonstrated to be also effective in handling both quasi-affixation and zhe-affixation.

7.1. Contributions

The specific contributions are reflected in the study of the following five topics, each constituting a chapter.

On the topic of the Role of Grammar, the investigation leads to the central argument that knowledge from both morphology and syntax is required to properly handle the major types of morpho-syntactic interface problems.  This establishes the foundation for the general design of CPSG95 as consisting of morphology and syntax in one grammar formalism.

An in-depth study has been conducted in the area of the segmentation ambiguity in Chinese word identification.  The most important discovery from the study is that the disambiguation involves the analysis of the entire input string.  This means that the availability of a grammar is key to the solution of this problem.  A natural solution to this problem is the use of grammatical analysis to resolve, and/or prepare the basis for resolving, the segmentation ambiguity.

On the topic of the Design of CPSG95, a mono-stratal Chinese phrase structure grammar has been established in the spirit of the HPSG theory.  Components of a grammar such as morphology, syntax and semantics are all accommodated in distinct features of a sign.  CPSG95 is designed to provide a framework and means for formalizing the analysis of the linguistic problems at the morpho-syntactic interface.

The essential part of this work is the design of expectation feature structures.  Expectation feature structures are generalized from the HPSG feature structures for syntactic subcategorization and modification.  One characteristic of the CPSG95 structural expectation is the design of morphological expectation features to incorporate Chinese productive derivation, which covers a wide range of linguistic phenomena in Chinese word formation.

In order to meet the requirements induced by introducing morphology into the general grammar and by accommodating linguistic characteristics of Chinese, modifications from the standard HPSG are proposed in CPSG95.  The rationale and arguments for these modifications have been presented.  The design of CPSG95 is demonstrated to be a successful application of HPSG in the study of Chinese morpho-syntactic phenomena.

On the topic of Defining the Chinese Word, efforts have been made to reach a better understanding of Chinese wordhood in theory, methodology and formalization.

The theoretical inquiry follows the insight from Di Sciullo and Williams (1987) and Lü (1989).  Two notions of word, namely grammar word and vocabulary word, have been examined and distinguished.  While vocabulary word is easy to define once a lexicon is given, the object for linguistic study and generalization is actually grammar word.  Unfortunately, as there is a considerable amount of borderline phenomena between Chinese morphology and syntax, no precise definition of Chinese grammar word has been available across systems.  Therefore, an argument in favor of the system-internal wordhood definition and interface coordination within a grammar has been made.  This leads to a case-by-case approach to the analysis of specific Chinese morpho-syntactic interface problems.

On the other hand, three useful wordhood judgment methods have also been proposed as a complementary means to the case-by-case analysis.  These methods are (i) syntactic process test involving passivization and topicalization; (ii) keyword based judgment patterns for verbs, and (iii) a general expansion test named X-insertion.  These methods are demonstrated to be fairly operational and easy to apply.

In terms of formalization, a system-internal representation of word has been defined in CPSG95 feature structures.  This definition distinguishes a grammar word from both bound morphemes and syntactic constructions.  The formalization effort is necessary for the rigid study of Chinese morpho-syntactic problems and ensures the implementability of the solutions to these problems as proposed in the dissertation.

On the topic of Chinese Separable Verbs, the task is to coordinate the idiomatic nature of separable verbs and their separated uses in various syntactic patterns.

Since there are different degrees of ‘separability’ for different types of Chinese separable verbs, there is no uniform analysis which can handle all separable verbs properly.  A case-by-case study for each type of separable verbs has been conducted.  An essential part of this study is the arguments for the wordhood judgment for each type.  In the light of this judgment, CPSG95 provides formalized analyses of separable verbs which satisfy two criteria:  (i)  they all capture both structural and semantic aspects of the constructions at issue; (ii) they all provide a way of capturing the link between the separated use and contiguous use.

Finally, on the topic of Morpho-syntactic Interface Involving Derivation, a general approach to Chinese derivation has been proposed.  This approach not only enables us to handle quasi-affix phenomena, but is also flexible enough to provide an adequate treatment of the special problem in zhe-suffixation.

In the CPSG95 analysis, the affix serves as head of a derivative and can impose various constraints in the lexicon on its expected stem sign for the morphological expectation.  Coupled with only two PS rules formulated in the general grammar (Prefix PS Rule and Suffix PS Rule), it has been shown that various Chinese affixation phenomena can be captured equally well.  The PS rules ensure that all the lexical constraints be observed before the affix and the stem combine and that the output of derivation be a word.

As for the quasi-affixation problem, based on the observation that there is no fundamental structural difference between quasi-affixation and other affixation, a proper treatment of ‘quasi-affixes’ can be established in the same way as other affixes are handled in CPSG95; the individual difference in semantics is shown to be capturable in the lexicon.

The study of zhe-suffixation started with arguments for its analysis of VP+-zhe.  This is an unsolvable problem in any system which enforces sequential processing of morphology before syntax.  The solution which CPSG95 offers demonstrates the power of designing derivation morphology and syntax in a mono-stratal grammar.   With this novel design in modeling Chinese grammar, the CPSG95 general approach to derivation readily applies to the tough case of zhe-suffixation.  This is possible because of the ability of an affix in placing any lexicalized constraints, VP in this case, on the expected stem for morphological expectation.  In addition, the proposed lexicalized solution also captures the building of the semantic content for this morpho-syntactic borderline phenomenon.

7.2. Limitation

The major limitation of the work reported in this thesis lies in the following two aspects.

Limited by space, the thesis has only presented some sample formulation of typical affixes and quasi-affixes to demonstrate the proposed general approach to Chinese derivation morphology.  As many affixes/quasi-affixes have their distinctive semantic property, a reader who likes to experiment with this proposal in implementation still has to work out the technical details for each affix.  However, it is believed that the general strategy has been presented in sufficient details to allow for easy accommodation of individual aspects of an affix which have not been specifically addressed in the thesis.

Limited by the focus on a handful of major morpho-syntactic interface problems, the treatment of reduplication and unlisted proper names have not been listed as special topics for in-depth exploration.  They are only briefly discussed in Chapter II (Section 2.2) as cases of productive word formation for the need to involve syntax when they involve segmentation ambiguity at the boundaries.  However, they are also long-standing word identification problems which affect morpho-syntactic interface when the segmentation ambiguity is involved.  In particular, it is felt that the treatment of transliterated foreign names requires further research before a satisfactory solution can be found in the framework of CPSG95.[2]

7.3. Final Notes

This last section is used to place the research reported in this thesis in a larger context.

Chinese NLP has reached a new stage marked by the publication of Guo’s series of papers on Chinese tokenization (Guo 1997a,b,c,d, Guo 1998).  There are signs that the major research focus is being shifted from word segmentation to the grammar design and development.  In this process,  the morph-syntactic interface will remain a hot topic for quite some time to come.  The work on CPSG95 can be seen as one of the efforts in this direction.

The design of CPSG95, a formal grammar capable of representing both morphology and syntax in a uniform formalism, is one successful application of the modern linguistic theory HPSG in the area of Chinese  morpho-syntactic interface research.  However, this is by no means to claim that CPSG95 is the only or best framework to capture the morpho-syntactic problems.   This is only one approach which has been shown to be feasible and effective.  Other equally good or better approaches may exist.

In terms of future directions, constraints from semantics and discourse should be made available in the grammatical analysis.  In Chapter II (Section 2.4), we have seen problems whose ultimate solutions depend on the access to the semantic or discourse constraints.  It is believed that the sign-based mono-stratal design of CPSG95 will be extensible to accommodate these constraints.  However, this will require years of future research before they can be formally modeled and properly introduced into the grammar.



[1] As a matter of fact, the CPSG95 experiment shows that most segmentation ambiguity is resolved automatically as a by-product of morpho-syntactic parsing and the remaining ambiguity is embodied in the multiple syntactic trees as the results of the analysis.

[2] However, in the CPSG95 implementation, the problem of handling the Chinese person names, a special case of compounding, has been solved fairly satisfactorily.  The proposal is to use the surname as the head sign to expect the given name (of one or two characters) on its right to form potential full names.  As the right boundary of a person name is difficult to define without the support of sentential analysis, the conventional word segmenter frequently makes wrong segmentation in such cases.  In contrast, the approach implemented in CPSG95 is free from this problem because whether a potential name proposed by the surname ultimately survive as a proper name is decided by whether it contributes to a valid parse for the processed sentence.  In last few years, there has been rapid progress on proper name identification in the area of information extraction, called named entity tagging (MUC7 1998; Chen et al 1997).



Bauer, Laurie (1988).  Introducing Linguistic Morphology.  Edinburgh:  Edinburgh University Press.

Bloomfield, Leonard (1933). Language, New York: Henry Holt & Co.

Borsley, Robert (1987).  Subjects and Complements in HPSG.   Technical report no. CSLI-107-87.  Stanford:  Center for the Study of Language and Information.

Carpenter, B. and G. Penn (1994).  ALE, The Attribute Logic Engine, User’s Guide.  From http://www.sfs.nphil.uni-tuebingen.de/~gpenn/ale.html (accessed January 30, 2001).

Chao, Yuen-Ren (1968).  A Grammar of Spoken Chinese.  Berkeley:  University of California Press.

Chen, H.-H et al (1997).  Description of the NTU System used for MET-2.  Proceedings of MUC-7.  From http://perso.enst.fr/~monnier/lectures/IE/MUC7/muc_7_toc.html (accessed January 30, 2001).

Chen, K. and S. Liu (1992).  Word Identification for Mandarin Chinese Sentences.  Proceedings of 14th International Conference on Computational Linguistics (COLING’92). Nantes, France, 101-107.

Chen, M.Y. and W. S-Y. Wang (1975).  Sound Change:  Actuation and Implementation.  Language 51:2, 255-281.

Chen, Ping (1994).  “Shilun Hanyu zhong San Zhong Juzi Chengfen yu Yuyi Cheng Fen de Peiwei Yuanze” (On Mapping Principles of Relationship between Chinese Three Syntactic Constituents and Semantic Roles). Zhongguo Yuwen (Chinese Linguistics), No.3.

Chomsky, Noam (1970).  Remarks on Nominalization.  Readings in English Transformational Grammar, eds. by R. Jacobs and P. Rosenbaum, Waltham, Massachasetts:  Ginn and Company, 184-221.

Dai, John Xiang-ling (1993).  Chinese Morphology and its Interface with Syntax.  Ph.D. Dissertation, Ohio State University.

DeFrancis, John (1984).  The ChineseLanguage: Fact and Fantasy.  Honolulu:  University of Hawaii Press.

Di Sciullo, A.M. and E. Williams (1987).  On The Definition of Word.  The MIT Press, Cambridge, Massachusetts.

Ding, Shengshu (1953). “Hanyu Yufa Jianghua” (Lectures of Chinese Grammar), Zhongguo Yuwen (Chinese Linguistics), No. 3 and No. 4.

Dowty, D. (1982).  More on the Categorial Analysis of Grammatical Relations.  In A. Zaenen (Ed.), Subjects and Other Subjects:  Proceedings of the Harvard Conference on Grammatical Relations.  Bloomington:  Indiana University Linguistics Club.

Feng, Zhiwei (1996).  COLIPS Lecture Series – Chinese Natural Language Processing,  Communications of COLIPS, Vol.6, No.1, Singapore.

Gan, Kok Wee (1995).  Integrating Word Boundary Disambiguation with Sentence Understanding, Ph.D. Dissertation, National University of Singapore.

Gazdar, G., E. Klein, G.K. Pullum, and I.A. Sag (1985).  Generalized Phrase Structure Grammar.  Cambridge: Blackwell, and Cambridge, Mass.:  Harvard University Press.

Guo, Jin (1997a).  Critical tokenization and its properties.  Computational Linguistics, Vo. 23, No.4, 569-596.

Guo, Jin (1997b).  Chinese Language Modeling for Speech Recognition.  Ph.D. dissertation, Institute of Systems Science, National University of Singapore.

Guo, Jin (1997c).  A Comparative Study on Sentence Tokenization Generation Schemes.  In review for journal publication from http://sunzi.iss.nus.sg:1996/guojin/papers/ (accessed March 25, 1999).

Guo, Jin (1998).  One tokenization per source.  Proceedings of the 17th International Conference on Computational Linguistics and 36th Annual Meeting of the Association for Computational Linguistics (COLING-ACL ’98),  Montreal, Canada, 457-463.

He, K., H. Xu and B. Sun (1991).  Design Principles of an Expert System for Automatic Word Segmentation of Written Chinese Texts, Journal of Chinese Information Processing, Vol. 5, No. 2, 1-14.

Hockett, C.F. (1958).  A Course in Modern Linguistics.  New York:  Macmillan.

Hu, F. and L. Wen (1954).  “Ci de fanwei, xingtai, gongneng” (Scope, form and function of word). Zhongguo Yuwen (Chinese Linguistics), August issue.

Jackendoff, Ray (1972). Semantic Interpretation In Generative Grammar, Cambridge, Massachusetts:  MIT Press.

Jensen, John T. (1990).  Morphology:  Word Structure in Generative Grammar.  Amsterdam/Philadephia:  John Benjamins Publishing Company.

Kathol, Andreas (1999).  Agreement and the Syntax-Morphology Interface in HPSG. In Robert Levine and Georgia Green (eds.) Studies in Current Phrase Structure Grammar. Cambridge University Press, 223-274.

Kolman, B. and R.C. Busby (1987). Discrete Mathematical Structures for Computer Science, 2nd edition. Prentice-Hall, Inc.

Krieger, Hans-Ulrich (1994). Derivation without Lexical Rules,  in C.J Rupp, M. Rosner and R. Johnson (eds), Constraints, Language, and Computation.  Academic Press, 277-313.

Li, C.N. and  S.A. Thompson (1981).  Mandarin Chinese:  A Functional Grammar.  Berkeley:  University of California Press.

Li, Linding (1986).  Xiandai Hanyu Juxing (Sentence Patterns in Contemporary Mandarin), Shangwu Yinshuguan (Commercial Press), Beijing.

Li, Linding (1990).  Xiandai Hanyu Dongci (Verbs in Contemporary Mandarin), Zhongguo Shehui Kexue Chubanshe, Beijing.

Li, Qinghua (1983).  “Tan liheci de tedian he yongfa” (On the characteristics and usages of separable words).  Yuyan Jiaoxue He Yan Jiu (Language Instruction and Research), No.3.

Li, Wei (1996).  Interaction of Syntax and Semantics in Parsing Chinese Transitive Patterns.  Proceedings of International Conference on Chinese Computing (ICCC’96), Singapore.

Li, Wei (1997).  Chart Parsing Chinese Character Strings.  Proceedings of the Ninth North American Conference on Chinese Linguistics (NACCL-9), Victoria, Canada.

Li, Wei (2000). On Chinese parsing without using a separate word segmenter.  Communication of COLIPS 10 (1): 19-68.

Liang, Nanyuan (1987).  CDWS — A Written Chinese Automatic Word Segmentation System.  Journal of Chinese Information Processing, 1(2): 44-52.

Lieber, R. (1992).  Deconstructing Morphology. Chicago: University of Chicago Press.

Lin, Handa (1983).  “Shime shi ci – xiaoyu ci de bu shi ci” (What is a word – a unit smaller than a word is not a word). Zhongguo Yuwen (Chinese Linguistics), No.34.

Lu, Jianming (1988).  “Mingci-xing ‘laixin’ shi ci haishi cizu” (Nominal laixin: word or word group).  Zhongguo Yuwen (Chinese Linguistics), No. 5.

Lu, Zhiwei (1957).  Hanyu de Goucifa (Chinese Word Formation), Kexue Chubanshe (Science Publishing House)..

Lü, Shuxiang. (1946). “Cong Zhuyu, Binyu de Fenbie Tan Guoyu Juzi de Fenxi” (On Sentence Analysis of Mandarin Chinese from the Angle of the Distinction between Subject and Object),  Kaiming Shudian Er Shi Zhounian Jiannian Wenji (Selected Works to Celebrate the 20th Anniversary of Kaiming Bookstore).

Lü, Shuxinag et al (ed.) (1980).  Xiandai Hanyu Babai Ci (800 Words in Contemporary Mandarin), Shangwu Yinshuguan (Commercial Press), Beijing.

Lü, Shuxiang (1989). “Hanyu Yufa Fenxi Wenti” (Issues on Chinese grammatical analysis),  Lü Shuxiang Zixuanji (Self-selected Works of Shuxiang Lü), Shang Hai Jiaoyu Chubanshe (Shanghai Education Publishing House), Shanghai, 93-180.

Lua, Kim Teng (1994).  Application of Information Theory Binding in Word Segmentation. Computer Processing of Chinese and Oriental Languages 8(1): 115-124.

Lyons, John (1968).  Introduction to Theoretical Linguistics.  Cambridge:  Cambridge University Press.

MUC-7 (1998).  Proceedings of the Seventh Message Understanding Conference (MUC-7).  From http://perso.enst.fr/~monnier/lectures/IE/MUC7/muc_7_toc.html (accessed January 30, 2001).

Pollard, C. and I. Sag (1987).  Information based Syntax and Semantics Vol. 1: Fundamentals.  Centre for the Study of Language  and Information, Stanford University, CA.

Pollard, C. and I. Sag (1994).  Head-Driven Phrase Structure Grammar.  The University of Chicago Press.

Riehemann, Susanne (1993). Word Formation in Lexical Type Hierarchies – A Case Study of bar-Adjectives in German. SfS-Report-02-93, University of Tübingen.

Riehemann, Susanne (1998). Type-based derivational morphology.  Journal of Comparative Germanic Linguistics 2. 49-77.

Sapir, Edward (1921).  Language:  Introduction to the Study of Speech.  NewYork:  Harcourt, Brace, and World.

Selkirk, E. (1982).  The Syntax of Words.  Cambridge:  MIT Press.

Shi, Youwei (1992).  Huhuan Rouxing – Hanyu Yufa Tanyi (A Call for Flexibility – Peculiarities of Chinese Grammar), Hunan Publishing House.

Shieber, S. (1986).  An Introduction to Unification-Based Approaches to Grammar.  Centre for the Study of Language  and Information, Stanford University, CA.

Sproat, R., C. Shih, V. Gale, and N. Chang (1996).  A Stochastic Finite-State Word-Segmentation Algorithm for Chinese.  Computational Linguistics. Vol. 22, No. 3.

Sun, L. and P. Cole (1991).  The effect of morphology on long-distance reflexives.  Journal of Chinese Linguistics 19:1, 42-62.

Sun, M. and B. T’sou (1995).  Ambiguity resolution in Chinese word segmentation.  Proceedings of the 10th Pacific Asia Conference on Language, Information and Computation (PACLIC-95), Hong Kong, 121-126.

Sun, M. and C. Huang (1996).  Word Segmentation and Part-of-Speech Tagging for Unrestricted Chinese Texts, A Tutorial at the 1996 International Conference on Chinese Computing (ICCC96), Singapore.

Thompson, S.A. (1973).  Resultative Verb Compounds in Mandarin Chinese:  A Case of Lexical Rules. Language 49:2, 361-379.

Wang, Li (1955).  ZhongguoYufa Lilun (Chinese Grammatical Theory), Zhonghua Shuju, Shanghai.

Wang, Xiaolong (1989).  Automatic Chinese Word Segmentation, in Word Separating and Mutual Translation of Syllable and Character Strings, Ph.D. Dissertation, Dept. of Computer Science and Engineering, Harbin Institute of Technology.

Webster, J. J. and C-Y Kit. (1992).  Tokenization as the Initial Phase in NLP.  Proceedings of the 14th International Conference on Computational Linguistics (COLING-92).  Nantes, France, 1106-1110.

Wu, A. and Z. Jiang (1998).  Word Segmentation in Sentence Analysis.  Proceedings of the 1998 International Conference on Chinese Information Processing.  Beijing, China, 169-180.

Wu, Dekai (1998).  A Position Statement on Chinese Segmentation.  Presented at the Chinese Language Processing Workshop, University of Pennsylvania. (Current draft at http://www.cs.ust.hk/~dekai/papers/segmentation.html, accessed January 30, 2001).

Wu, M. and K. Su (1993).  Corpus-Based Automatic Compound Extraction with Mutual Information and Relative Frequency Count.  Proceedings of R.O.C. Computational Linguistics Conference (ROCLING) VI, Taiwan, 207-216.

Xue, Ping (1991).  Syntactic Dependencies in Chinese and their Theoretical Implications.  Ph.D. dissertation, University of Victoria, Canada.

Yao, T., G. Zhang, and Y. Wu (1990).  A Rule-Based Chinese Automatic Segmentation System.  Journal of Chinese Information Processing 4(1): 37-43.

Yeh, C-L. and H-J. Lee (1991).  Rule-Based Word Identification For Mandarin Chinese Sentences — A Unification Approach.  Computer Processing of Chinese and Oriental Languages. Vol. 5, No. 2, 97-118.

Yu, Shihong et al (1997).  Description of the Kent Ridge Digital Labs System Used for MUC-7.  Proceedings of MUC-7.  From http://perso.enst.fr/~monnier/lectures/IE/MUC7/muc_7_toc.html (accessed January 30, 2001).

Zhang, J., Z. Chen and S. Chen (1991).  A Method of Word Identification for Chinese by Constraint Satisfaction and Statistical Optimization Techniques.  Proceedings of R.O.C. Computational Linguistics Conference (ROCLING) IV, Taiwan, 147-165.

Zhang, Shoukang (1957).  “Lüetan hanyu goucifa” (A brief discussion on Chinese word formation)  Xiandai Hanyu Cankao Ziliao (Reference for Comtemporary Chinese),  ed. by Yushu Hu (1981),  Shanghai:  Shanghai Jiaoyu Chubanshe (Shanghai Education Publishing Company), 241-256.

Zhao, S. and B. Zhang (1996).  “Liheci de queding yu liheci de xingzhi” (Determination and characteristics of separable words).  Yuyan Jiaoxue he Yanjiu (Language Instruction and Research), No.1, 40-51.

Zhu, Dexi (1985).  Yufa Wenda (Questions and Answers on Chinese Grammar).  Shangwu Yinshuguan (Commercial Press), Beijing.

Zwicky, A.M. (1987). Slashes in the Passive.  Linguistics 25, 639-669.

Zwicky, A.M. (1989).  Idioms and Constructions.  Eastern States Conference on Linguistics 5, 547-558.



PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

PhD Thesis: Chapter I Introduction

PhD Thesis: Chapter II Role of Grammar

PhD Thesis: Chapter III Design of CPSG95

PhD Thesis: Chapter IV Defining the Chinese Word

PhD Thesis: Chapter V Chinese Separable Verbs

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

PhD Thesis: Chapter VII Concluding Remarks

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

6.0. Introduction

This chapter studies some challenging problems of Chinese derivation and its interface with syntax.  These problems have been a challenge to existing word segmenters; they are also long-standing problems for Chinese grammar research.

It is observed that a good number of signs have become more and more like affixes as the Chinese language develops.  Typical, indisputable examples include signs like the nominalizer 性 ‑xing (-ness) and the prefix 第 di- (-th).  While few people doubt the existence of affixes in Contemporary Chinese, there is no general agreement on the exact number of Chinese affixes, due to a considerable number of borderline cases often referred to as ‘quasi-affixes’ (类语缀 lei yu-zhui).[1]  It will be argued that the quasi-affixes belong to morphology and are structurally not different from other affixes.  The major difference between ‘quasi-affixes’ and the few generally honored (‘genuine’) affixes lies mainly in the following aspect.  The former retain some ‘solid’ meaning while the latter are more functionalized.  However, this does not prevent CPSG95 from providing a proper treatment of quasi-affixes in the same way as it handles other affixes.  It will be shown that the difference in semantics between affixes or quasi-affixes can be accommodated fairly easily in the CPSG95 lexicon.

Based on the examination of the common property of Chinese affixes and quasi-affixes, a general approach to Chinese derivation is proposed.  This approach not only enables us to handle quasi-affix phenomena, but is also flexible enough to provide an adequate treatment of a special problem in Chinese derivation, namely zhe-suffixation.  The affix status of 者 -zhe (-er) is generally acknowledged (classified as suffix in the authoritative books like Lü et al 1980):  it attaches to a verb sign and produces a word.  The peculiar aspect of this suffix is that the verb stem which it attaches to can be syntactically expanded.  In fact, there is significant amount of evidence for the argument that this suffix expects a VP as its stem (see 6.5 for evidence).   Since a VP is only formed in syntax and derivation is within the domain of morphology, this phenomenon presents a highly challenging case on how morphology should be interfaced properly to syntax.  The solution which is offered in CPSG95 demonstrates the power of designing morphology and syntax in an integrated grammar formalism.  In contrast, in any system which enforces sequential processing of derivation morphology before syntax – most traditional systems assume this, this is an unsolvable problem.  There does not seem to be a way of enabling partial output of syntactic analysis (i.e. VP) to feed back to some derivation rule in the preprocessing stage.

In Section 6.1, the general approach to Chinese derivation is proposed first.  Following this proposal, prefixation is illustrated in 6.2 and suffixation in 6.3.  Section 6.4 shows that this general approach to derivation applies equally well to the ‘quasi-affix’ phenomena.  Section 6.5 investigates the suffixation of -zhe (-er).  The analysis is based on the argument that this suffixation involves the combination VP+-zhe.  The specific solution following the CPSG95 general approach will be presented based on this analysis.

6.1. General Approach to Derivation

This section examines the property of Chinese affixes and proposes a corresponding general approach to Chinese derivation.  This serves as the basis for the specific solutions to be presented in the remaining sections to various problems in Chinese derivation.

It is fairly easy to observe that in Chinese derivation it is the affix which selects the stem, not the other way round.  For example, the suffix 性 -xing (‑ness) expects an adjective to produce an (abstract) noun.   Based on the examination of the behavior of a variety of Chinese affixes or quasi-affixes, the following generalization has been reached.  That is, an affix lexically expects a sign of category x, with possible additional constraints, to form a derived word of category y.   This generalization is believed to capture the common property shared by Chinese affixes/quasi-affixes.  It seems to account for all Chinese derivational data, including typical affixation, quasi-affixation (see 6.4) and the special case of zhe-suffixation (see 6.5).  So far no counter evidence has been found to challenge this generalization.

The observation and the generalization above support the argument that in a grammar which relies on lexicalized expectation feature structures to drive the building of structures, affixes, not the stems, should be selecting heads of the morphological structures.[2]   Leaving aside the non-productive affixation,[3] the general strategy to Chinese productive derivation is proposed as follows.  In the lexicon, the affix as head of derivative is encoded with the following derivation information:  (i) what type of stem (constraints) it expects;  (ii) where to look for the expected stem, on its right or left;  (iii) what type of (derived) word it leads to (category, semantics, etc.).  Based on this lexical information, CPSG95 has two PS rules in the general grammar for derivation:  one for prefixation, one for suffixation.[4]  These rules ensure that all the constraints be observed before an affix and a stem are combined.  They also determine that the output of derivation, i.e. the mother sign, be a word.

Along this line, the key to a lexicalized treatment of Chinese derivation is to determine the structural and semantic property of the derivative and to impose proper constraints on the expected stem.  The constraints on the expected stem can be lexically specified in the morphological expectation feature [PREFIXING] or [SUFFIXING] of the affix.  The property (category, syntactic expectation, semantics, etc.) of the derivative can also be encoded directly in the lexical entry of the affix, seen as the head of a derivational structure in the CPSG95 analysis.  This property information, as part of head features, will be percolated up when the derivation rules are applied.

In the remaining part of this chapter, it will be demonstrated how this proposed general approach is applied to each specific derivation problem.

6.2. Prefixation

The purpose of this section is to present the CPSG95 solution to Chinese prefixation.  This is done by formulating a sample lexical entry for the ordinal prefix 第 di- (-th) in CPSG95.  It will be shown how the lexical information drives the prefix rule in the general grammar for the derivational combination.

Thanks to the productivity of the prefix 第 di- (-th), the ordinal numeral is always a derived word from the cardinal numeral via the following rule, informally formulated in (6-1).

(6-1.) 第 di- + cardinal numeral –> ordinal numeral

di-      22      tiao    jun-gui
-th     22      CLA   military-rule
the 22-nd military rule (Catch-22)

di-      ba      ge      shi     tong-xiang
-th     eight  CLA   be      bronze-statue
The eighth is the bronze statue.

The basic function of the Chinese numeral, whether cardinal or ordinal,  is to combine with a classifier, as shown in the sample sentences above.

To capture this phenomenon, CPSG95 defines two subtypes for the category numeral [num], namely the [cardinal_num] and [ordinal_num].   The lexical entries of the prefix 第 di‑ (‑th) and the cardinal numeral 五 wu (five) are formulated in (6-2) and (6-3).  The prefix encodes the lexical expectation for the derivation 第 di- + [cardinal_num] ‑‑> [ordinal_num] plus the semantic composition of the combination.  Note that the constraint @numeral inherits all common property specified for the numeral macro.


As indicated before, prefixation in CPSG95 is handled by the Prefix PS Rule based on the lexical specification.  More specifically, it is driven by the lexical expectation encoded in [PREFIXING].  The prefix rule is formulated in (6-4).


Like all PS rules in CPSG95, whenever two adjacent signs satisfy all the constraints, this rule takes effect in combining them into a higher level sign in parsing.  For example, the prefix 第 di- (-th) and the sign 五 wu (five) will be combined into the sign as shown in (6-5).


The combination of 第五 di+wu in (6-5) demonstrates how the morphological structure is built in the CPSG95 approach to Chinese prefixation.

6.3. Suffixation

Like prefixation, the Suffix PS Rule for suffixation is driven by the lexically encoded expectation in [SUFFIXING].  Parallel to the Prefix PS Rule, the suffix rule is formulated in (6-6).


With this PS rule in hand, all that is needed is to capture the individual derivational constraint in the lexical entries of the suffixes at issue.  For example, the suffix 性 –xing (-ness) changes an adjective or verb into an abstract noun:  A/V + ‑xing  ‑‑> N.  This information is contained in the formulation of the suffix 性 –xing (-ness) in the CPSG95 lexicon, as shown in (6-7).


Note that abstract nouns are uncountable, hence the call to the uncountable_noun macro to inherit the common property of uncountable nouns.[5]

Suppose the suffix 性 -xing (-ness) appears immediately after the adjective 实用 shi-yong (practical) formulated in (6-8), the suffix PS rule will combine them into a noun, as  shown in (6-9).


The combination of 实用性 shi-yong+xing in (6-9) demonstrates how the morphological structure is built in the CPSG95 approach to Chinese suffixation.

6.4. Quasi-affixes

The purpose of this section is to propose an adequate treatment of the quasi-affix phenomena in Chinese.  This is an area which has not received enough investigation in the field of Chinese NLP.  Few Chinese NLP systems demonstrate where and how to handle these quasi-affixes.

To achieve the purpose, typical examples of ‘quasi-affixes’ are presented and compared with some ‘genuine’ affixes.  The comparison highlights the general property shared by both ‘quasi-affixes’ and other affixes and also shows their differences.  Based on this study, it is found to be a feasible proposal to treat quasi-affixes within the derivation morphology of CPSG95.  The proposed solution will be presented by demonstrating how a typical quasi-affix is represented in CPSG95 and how the general affix rules can work with the lexical entries of ‘quasi-affixes’ as well.

The tables in (6-10) and (6-11) list some representative quasi-affixes in Chinese.

(6-10.)         Table for sample quasi-prefixes

prefixation examples
lei (quasi-)+N –> N 类前缀 lei-[qian-zhui]: quasi-[pre-fix]
前缀 qian (before, pre-, former-) zhui (…)
ban (semi-)+N –> N 半文盲 ban-[wen-mang]: semi-illiterate
文盲 wen (written-language), mang (blind)
dan (mono-)+N –> N 单音节 dan-[yin-jie]: mono-syllable
音节 yin (sound), jie (segment)
shuang (bi-)+N –> N 双音节 shuang-[yin-jie]: bi-syllable
duo (multi-)+N –> N 多音节 duo-[yin-jie]: multi-syllable
fei (non-)+N/A –> A 非谓 fei-wei: non-predicate
非正式 fei-[zheng-shi]: non-official
xiang (each other)+Vt (mono-syllabic) –> Vi 相爱 xiang-ai: love each other
zi (self-)+Vt –> Vi 自爱 zi-ai: self-love zi-xue-xi: self-learning
qian (former, ex-) + N
–> N
前夫人 qian-[fu-ren]: ex-wife
前总统 qian-[zong-tong]: former president

(6-11.)         Table for sample quasi-suffixes

suffixation Examples
N + shi (style) –> N 美国式 [mei-guo]-shi: American-style
NUM/N + xing (model)
–> N
1980型 1980-xing: 1980 model;
IV型 IV-xing: Model IV
A/V + (rate) –> N 准确率 [zhun-que]-lü: (percentage of) precision
NUM + liu (class) –> A 一流 yi-liu: first class
三流 san-liu: third class
N + mang (‘blind’, person who has little knowledge of) –> N 法盲 fa-mang:
person who has no knowledge of law
计算机盲 [ji-suan-ji]-mang: computer-layman

Compare the above quasi-affixes with the few widely acknowledged affixes like 性 -xing (-ness) and 第 di- (-th), it is fairly easy to observe that the property as generalized in Section 6.1 is shared by both affixes and quasi-affixes.  That is, in all cases of the combination, the affix or quasi-affix expects a sign of category x, with possible additional constraints, either on the right or on the left to form a derived word of category y (y may be equal to x).  For example, the quasi-prefix 自 zi- (self-) expects a transitive verb to produce an intransitive verb, etc.  This property supports the following two points of view:  (i) the affix or quasi-affix is the selecting head of the combination;  (ii) both types of combination (affixation) should be properly contained in morphology since the output is always a word (derivative).

In terms of difference, it is observed that there are different degrees of the functionalization of the meaning between quasi-affixes and other affixes.  For example, the nominalizer 性 -xing (‑ness) seems to be semantically more functionalized than the quasi-suffix 盲 -mang (blind-man, person who has little knowledge of).  In the case of 性 -xing (-ness), there is believed to be little semantic contribution from the affix.  But in cases of affixation by quasi-affixes, the semantic contribution of the affixes is non-trivial, and it must be ensured that proper semantics be built based on semantic compositionality of both the stem and the affix.

Except for the different degrees of semantic abstractness, there is no essential grammatical difference observed between quasi-affixes and the few widely accepted affixes.  As the semantic variation can be easily accommodated in the lexicon, nothing needs to be changed in the  general approach to Chinese derivation as described before.  The text below demonstrates how the quasi-affix phenomena are handled in CPSG95, using a sample quasi-affix to show the derivation.

The quasi-prefix to examine is 相 xiang- (each other).  It is used before a mono-syllabic transitive verb, making it an intransitive verb: 相 xiang- + Vt (monosyllabic) ‑‑> Vi.  More precisely, the syntactic object of the transitive verb is morphologically satisfied so that the derivative becomes an intransitive verb.

Unlike the original verb, the verb derived via xiang-prefixation requires a plural subject, as shown in (6-12).  This is a linguistically interesting phenomenon.  In a sense, it is a version of subject-predicate agreement in Chinese.

(6-12.) (a)    他们相爱过。
ta-men         xiang-         ai       guo
they            each-other   love    GUO
They used to love each other.

(b)      他爱过。
ta       ai       guo
he      love    GUO.
He used to love (someone).

(b) *   他相爱过。
ta       xiang-         ai       guo
he      each-other   love    GUO.

This number agreement can help decode the plural semantics of the subject noun as shown in the first sentence (6-13a) in the following group.  Sentence (6-13a) illustrates a common, number-underspecified case where the NP has no plural marker.  This contrasts with (6-13b) which includes a plural marker 们 men (-s), and with (6-13c) which resorts to the use of a numeral-classifier construction.

(6-13.) (a)     孩子相爱了。
hai-zi           xiang-         ai       le
child           each-other   love    LE
The children have fallen in love with each other.

(b)      孩子们相爱了。
hai-zi men   xiang-         ai       le
child  PLU   each-other   love    LE
The children have fallen in love with each other.

(c)      两个孩子相爱了。
liang ge      hai-zi           xiang-         ai       le
two    CLA   child           each-other   love    LE
The two children have fallen in love with each other.

Following the practice for number agreement in HPSG, the agreement can be captured by enforcing an additional plural constraint on the subject expectation [SUBJ | SIGN | CONTENT | INDEX | NUMBER plural], as shown in the formulation of the lexical entry for 相 xiang- (each other) in (6-14) below.


As shown above, the affixation also necessitates corresponding modification of the semantics in the argument structure:  the first argument is equal to the second via index [2].[6]  Note that the notation [ ], or more accurately, the most general feature structure, is used as a place holder.  For example, HANZI <[ ]> stands for the constraint of a mono-hanzi sign.  Another thing worth noticing is that the derivative requires that a subject must appear before it.  In other words, the subject expectation becomes obligatory.  This is based on the fact that this derived verb cannot stand by itself in syntax, unlike most original verbs in Chinese, say 爱 ai (love), whose subject expectation is optional.

With the lexical entries for the quasi-affixes taking care of the differences in the building of semantics, there is no need for any modification of the CPSG95 PS rules.  For example, the prefix 相 xiang- (each other) and the verb 爱 ai (love) formulated in (6-15) will be combined into the derivative 相爱 xiang-ai (love each other) shown in (6-16) via the Prefix PS Rule.


In summary, the proposed approach to Chinese derivation is effective in handling quasi-affixes as well.  The general grammar rules for derivation remain unchanged while lexical constraints are accommodated in the lexicon.  This demonstrates the advantages of the lexicalized design for grammar development.

6.5. Suffix 者 zhe (-er)

This section analyzes zhe-suffixation, a highly challenging  case at the interface between morphology and syntax.  This is believed to be an unsolvable problem as long as a system is based on the sequential processing of derivation morphology and syntax.  The solution to be proposed in this section is based on the argument that this suffixation is a combination of VP+zhe.

The suffix 者 zhe (-er, person) is a very productive bound morpheme.   It is often compared to the English suffix ‑er or ‑or, as seen in the pairs in (6-17).

工作 gong-zuo (work)      工作者 [gong-zuo]-zhe (work‑er)
劳动 lao-dong (labor)       劳动者 [lao-dong]-zhe (labor-er)
学习 xue-xi (learn)           学习者 [xue-xi]-zhe (learn-er);.

But 者 ‑zhe is not an ordinary suffix;  it belongs to the category of so-called ‘phrasal affix’,[7] with very different characteristics than the English counterpart.  Although the output of the zhe-suffixation is a word, the input is a VP, not a lexical V.  In other words, it combines with a VP and produces a lexical N:  VP+zhe –> N.   The arguments to be presented below support this analysis.

The first thing is to demonstrate the word status of zhe‑suffixation.  This is fairly straightforward:  there are no observed facts to show that the zhe-derivative is different from other lexical nouns in the syntactic distribution.  For example, like other lexical nouns, the derivative can combine with an optional classifier construction to form a noun phrase.   Compare the following pairs of examples in (6-18) and (6-19).

(6-18.) (a)    两名违反这项规定者
liang  ming [[wei-fan      zhe    xiang gui-ding]     -zhe]
two    CLA   violate         this    CLA   regulation   -er
two persons who have violated this regulation

(b)    两名学生
liang  ming xue-sheng
two    CLA   student
two students

(6-19.) (a)    他是一位优秀工作者
ta       shi     yi       wei    you-xiu        [[gong-zuo]   -zhe]
he      be      one    CLA   excellent      work           -er
He is an excellent worker.

(b)    他是一位优秀工人。
ta       shi     yi       wei    you-xiu        gong-ren
he      be      one    CLA   excellent      worker
He is an excellent worker.

The next thing is to demonstrate the phrasal nature of the ‘stem’.[8]   The stem is judged as a VP because it can be freely expanded by syntactical complements or modifiers without changing the morphological relationship between the stem and the suffix, as shown in (6‑20) below.  (6-20a) involves a modifier (努力 nu-li) before the head verb.  The verb stem in (6-20b) and (6-20c) is a transitive VP consisting of a verb and an NP object.

(6-20.) (a)    努力工作者
[nu-li  gong-zuo]     -zhe
hard  work           ‑er
hard-worker, person who works hard

(b)      学习鲁迅者
[xue-xi         Lu Xun]       -zhe
learn           Lu Xun       -er
person who is learning from Lu Xun

(c)      违反这项规定者
[wei-fan       zhe    xiang           gui-ding]      -zhe
violate         this    CLA   regulation   -er
person who violates this rule

More examples with the head verb 雇 gu (employ) are given in (6-21), with the last two expressions involving passivized VP.

(6-21.)(a)    雇者

(b)      雇人者
[gu               ren]             -zhe
employ        person         -er
those who employ people, employer/recruiter

(c)      被雇者
[bei gu]                  -zhe
[be-employed]       -er

(d)      被人雇者
[bei    ren              gu]               -zhe
by      person         employ        -er
those who are employed by (other) people

In fact, the stem VP is semantically equivalent to a relative clause.   A Chinese relative clause is normally expressed in the form of a DE-phrase: VP+de+N (Xue 1991).  In other words, 者 ‑zhe embodies functions of two signs, an N (‘person’, by default) and a relative clause introducer de, something like English one that + VP (or person who + VP).[9]  Compare the two examples in (6-22) and (6-23) with the same meaning – the expression in (6-23) is more colloquial than the first in (6-22) which uses the suffix 者‑zhe.

(6-22.) 违反规定者,处以罚款。
wei-fan        gui-ding       zhe,            chu-yi                   fa-kuan
violate         regulation   one that      punish-by   fine

Those who violate the regulations will be punished by fines.

(6-23.) 违反规定的人,处以罚款。
wei-fan        gui-ding       de      ren,             chu-yi          fa-kuan
violate         regulation   DE     person         punish-by   fine
Those who violate the regulations will be punished by fines.

On further examination, it is found that VPs with attached aspect markers combine with the suffix 者 -zhe with difficulty, as seen in the following examples.

(6-24.) (a)    违反规定者
wei-fan        gui-ding       zhe
violate         regulation   -er
Those who violate the regulations

(b) ?  违反了规定者
wei-fan        le       gui-ding       zhe
violate         LE     regulation   one that

This means that some further constraint may be necessary in order to prevent the grammar from producing strings like (6-24b).  If CPSG95 is only used for parsing, such a constraint is not absolutely necessary because, in normal Chinese text, such input is almost never seen.  Since CPSG95 is intended to be procedure-neutral, for use in both parsing and generation, the further constraint is desirable.

This constraint is in fact not an isolated phenomenon in Chinese grammar.  In syntax, the constraint is commonly required when the VP is not in the predicate position.[10]  For example, when a verb, say 喜欢 xi-huan (like), or a preposition, say 为了 wei-le (in order to), subcategorizes for a VP as a complement, it actually expects a VP with no aspect markers attached.   The following pair of sentences demonstrates this point.

(6-25.) (a)    我喜欢打篮球。
wo     xi-huan       da      lan-qiu.
I         like              play   basket-ball
I like playing basket-ball.

(b) * 我喜欢打了篮球。
wo     xi-huan       da      le       lan-qiu
I         like              play   LE     basket-ball

To accommodate such common constraint requirement in both Chinese morphology and syntax, a binary feature [FINITE] is designed for Chinese verbs in CPSG95.  In the lexicon, this feature is under-specified for each Chinese verb, i.e. [FINITE bin].  When an aspect marker 了着过 le/zhe/guo combines with the verb, this feature is unified to be [FINITE plus].  We can then enforce the required constraint [FINITE minus] in the morphological expectation or syntactic expectation to prevent aspected VP from appearing in a position expecting a non-predicate un-aspected  VP.

Based on the above analysis, the lexical entry of the suffix 者 –zhe is formulated in (6-26).  Note the notation for the macro with parameter (placed in parentheses) @common_noun(名|位|个).  This macro represents the following information.  The derivative is like any other common noun, it inherits the common property;  it can combine with an optional classifier construction using the classifier 名 ming or 位  wei or 个 ge.[11]


As seen, the VP expectation is realized by using the macro constraint @vp.  The semantics of the derivative is [np_semantics], an instance of -er with restriction from the event of VP, represented by [2].  The index [1] ensures that whatever is expected as a subject by the VP, which has no chances to be satisfied syntactically in this case, is semantically identical to this noun.[12]  In other words, this derived noun semantically fills an argument slot held by the subject in the VP semantics [v_content].  In the active case, say, 雇人者 [gu ren]–zhe (‘person who employs people’), the subject is the first argument, i.e. the index of this noun is the logical subject of employ.  However, when the VP is in passive, say, 被人雇者 [bei ren gu]‑zhe (‘person who is employed by other people’), the subject expected by the VP fills the second argument, i.e. the noun in this case is the logical object of the VP.  It is believed that this is the desired result for the semantic composition of zhe-derivation.

With the lexical expectation of the suffix as the basis, the general Suffix PS Rule is ready to work.  Remember that there is nothing restricting the input stem to the derivation in either of the derivation rules, formulated in (6-4) and (6-6) before.  In CPSG95, this is not considered part of the general grammar but rather a lexical property of the head affix.  It is up to the affix to decide what constraints such as category, wordhood status, semantic constraint, etc., to impose on the expected stem to produce a derivative.  In most cases of derivation, the input status of the stem is a word, but now we have an intricate case where the suffix zhe (-er) expects a verb phrase for derivation.  The general property for all cases of derivation is that regardless of the input, the output of derivation (as well as any other types of morphology) is always a word.

Before demonstrating by examples how zhe-derivation is implemented, there is a need to address the configurational constraints of CPSG95.  This is an important factor in realizing the flexible interaction between morphology and syntax as required in this case.

In all HPSG-style grammars, some type of configurational constraint is in place to ensure the proper order of rule application.  A typical constraint is that the subject rule should apply after the object rule.  This is implemented in CPSG95 by imposing the constraint in the subject PS rule that the head daughter must be a phrase and by imposing the constraint in the object PS rule that the subject of the head daughter may not be satisfied.[13]

Since derivation morphology and syntax are designed in the same framework in CPSG95, constraints are called for to ensure the ordering of rule application between morphological PS rules and syntactic PS rules as well.  In general, morphological rules apply before syntactic rules.  However, if this constraint is made absolute, to the extent that that all morphological rules must apply before all syntactic rules, we in effect make morphology and syntax two independent, successive modules, just like the case for traditional systems.  The grammar will then lose the power of flexible interaction between morphology and syntax and cannot handle cases like zhe-derivation.  However, this is not a problem in CPSG95.

The proposed constraint regulating the rule application order between morphological PS rules and syntactic PS rules is as follows.  Only when a sign has both obligatory morphological expectation and syntactic expectation will CPSG95 have constraints ensuring that the morphological rule apply first.  For example, as formulated in (6-14) before, the sign 相 xiang- (each other) has both morphological expectation in [PREFIXING] as a bound morpheme and syntactic expectation for the subject in [SUBJ] as (head of) derivative.  If the input string is 他们相爱  ta-men (they) xiang- (each other) ai (love), the prefix rule will first combine 相 xiang- (each other) and the stem 爱 ai (love) before the subject rule can apply.  The result is the expected structure embodying the results of both morphological analysis and syntactic analysis, [ta-men [xiang- ai]].  This constraint is implemented by specifying in all syntactic PS rules that the head daughter cannot have obligatory morphological expectation yet to be satisfied.  It effectively prevents a bound morpheme from being used as a constituent in syntax.   It should be emphasized that this constraint in the general grammar does not prohibit a bound morpheme from combining with any types of sign;  such constraints are only lexically decided in the expectation feature of the affix.

The following text shows step by step the CPSG95 solution to the problem of zhe-derivation.  The chosen example is the derivation for the derived noun 违法规定者 [[wei-fan gui-ding]-zhe]  ‘persons violating (the) regulation’.  The lexical sign of the suffix 者 -zhe (-er) has already been formulated in (6-26) before.  The words 违反 wei-fan (violate) and 规定 gui-ding (regulation) in the CPSG95 lexicon are shown in (6-27) and (6-28) respectively.


Note that all common nouns, specified as @common_noun, in the lexicon have the following INDEX features [PERSON 3, NUMBER number], i.e. third person with unspecified number.  As for the feature [GENDER], it is encoded in the noun itself with one of the following [male], [female], [have_gender], [no_gender] or unspecified as [gender].   The corresponding sort hierarchy is: [gender] consists of sub-sorts [no_gender] and [have_gender];  and [have_gender] is sub-typed into [male] and [female].  Of course, 规定 gui-ding (regulation) is lexically specified as [GENDER no_gender].

The following is the VP built by the object PS rule in the CPSG95 syntax.  As seen, the building of the semantics follows the practice in HPSG, with the argument slots filled by the [INDEX] feature of the subject and object.  In this VP case, [ARG2] has been realized.

The VP result in (6-29) and the suffix 者 –zhe will combine into the expected derived noun via the Suffix PS Rule, as shown in (6-30).


To summarize, it is the integrated model of derivational morphology and syntax in CPSG95 that makes the above analysis implementable.  Without the integration, there is no way that a suffix is allowed to expect a phrasal stem.[14]  The lexicalist approach adopted in CPSG95 facilitates the capturing of the individual feature of the phrase expectation for the few individual affixes like 者 –zhe. This enables the general PS rules for derivation in CPSG95 to be applicable to both typical cases of affixation and special cases of affixation.

6.6. Summary

This chapter has investigated some representative phenomena of Chinese derivation and their interface to syntax.  The solutions to these problems have been presented based on the arguments for the analysis.

The key to a lexicalized treatment of Chinese derivation is to determine the structural and semantic property of the derivative and to impose proper constraints on the expected stem.  The constraints on the expected stem are lexically specified in the corresponding morphological expectation feature structure of the affix.  The property of the derivative is also lexically encoded in the affix, seen as head of derivational structure in the CPSG95 analysis.  This property information will be percolated up when the derivation rules are applied.  These rules ensure that the output of derivation is a word.  It has been shown that this approach applies equally well to derivation via ‘quasi-affixes’ and the tough case of zhe-suffixation as well.



[1] Some linguists (e.g. Li and Thompson 1981) hold the view that Chinese has only a few affixes;  others (e.g. Chao 1968) believe that the inventory of Chinese affixes should be extended to include quasi-affixes.  Interestingly, the sign lei (quasi-, original sense ‘class’) itself is a quasi-prefix in Chinese.  Phenomena similar to Chinese quasi-affixes, called ‘semi-affixes’ or ‘Affixoide’, also exist in German morphology (Riehemann 1998).

[2] This is similar to the practice in many grammars, including HPSG, that a functional sign preposition is the selecting head of the corresponding syntactic structure, namely Prepositional Phrase.

[3] Those affixes which are not or no longer productive, e.g. lao‑ (original meaning ‘old’) in lao‑hu (tiger) and lao‑shu (mouse),  are not a problem.  The corresponding derived words are simply listed in the CPSG95 lexicon.

[4] The CPSG95 phrase-structural approach to Chinese productive derivation was inspired by the implementation in HPSG of a word-syntactic approach in Krieger (1994).  Similar practice is also seen in Selkirk (1982), Riehemann (1993) and Kathol (1999) in an effort to explore alternative approaches than the lexical rule approach to morphology.

[5] The major common property is reflected in two aspects, formulated in the macro definition of uncountable_noun in CPSG95.  First, there is value setting for the [NUMBER] feature, i.e. [CONTENT|INDEX|NUMBER no_number].  The CPSG95 sort hierarchy for the type [number] is defined as {a_number, no_number} where [a_number] is further sub-typed into {singular, plural}.  [NUMBER no_number] applies to uncountable nouns while [NUMBER a_number] is used for countable noun where the plurality is yet to be decided (i.e. under-specified for plurality).  Second, based on the syntactic difference between Chinese countable nouns and uncountable nouns, the classifier expected by uncountable nouns is exclusively zhong (kind/sort of).  That is, uncountable nouns may only combine with a preceding classifier construction using the classifier zhong.

[6] For time being, the subtle difference in semantics between pairs like We love ourselves and We love each other is not represented in the content.  It requires a more elaborate system of semantics to reflect the nuance.  The elaboration of semantics is left for future research.

[7] Some linguists (e.g. Z. Lu 1957; Lü et al 1980; Lü 1989; Dai 1993) have briefly introduced the notion of ‘phrasal affix’ in Chinese.  Lü further indicates that these ‘phrasal affixes’ are a distinctive characteristic of the Chinese grammar.

[8] The English possessive morpheme ‘s is arguably a suffix which expects an NP instead of a lexical noun as its stem:  NP + -’s.  Unlike VP + -zhe, the result of this NP + -‘s combination is generally regarded as a phrase, not a word.  In this sense, ‘s seems to be closer to a functional word, similar to a preposition or postposition, than to a suffix.

[9] Chinese zhe-suffixation is somewhat like the English phenomenon of what-clause (in ‘what he likes is not what interests her’). ‘What’ in this use also embodies functions of two signs that which. But the English what-clause functions as an NP, but VP+zhe forms a lexical N.

[10] It is generally agreed in the circle of Chinese grammar research that Chinese predicate (or finite) verbs have aspect distinction, using or not using aspect markers.  This is in contrast to English where both finite and non-finite verbs have aspect distinction but only finite verbs are tensed.

[11] It is generally agreed that each Chinese common noun may only combine with a classifier construction using a specific set of classifiers.  This classifier specification is generally regarded as lexical, idiosyncratic information of nouns (Lü et al 1980).  Using the macro with the classifier parameter follows this general idea.  It is worth noticing that the lexical formulation for -zhe (-er) in CPSG95 does not rely on any specific NP analysis chosen in syntax, except that the classifier specification should be placed under the entry for nouns (or derived nouns).

[12] The proposal in building the semantics for the zhe-derivative is based on ideas similar to the assumption adopted for the complement control in HPSG that ‘the fundamental mechanism of control was coindexing between the unexpressed subject of an unsaturated complement and its controler’ (Pollard and Sag 1994:282).

[13] If the object expectation is obligatory, this constraint ensures the priority of the object rule over the subject rule in application, building the desirable structure [S [V O]] instead of [[S V] O].  This is because, a verb with obligatory object yet to be satisfied is by definition not a phrase.  If the object expectation is optional, the order of rule application is still in effect although the lexical V in this scenario does not violate the phrase definition.  There are two cases for this situation.  In case one, the object O happens to occur in the input string.  The subject PS rule will tentatively combine S and V via the subject rule, but it can go no further.  This is because the object rule cannot apply after the subject rule, due to the constraint in the object rule that the head cannot have a satisfied subject.  The successful parse will only build the expected structure [S [V O]].  In case two, the object O does not appear in the input string.  Then the tentative combination [S V] built by the subject rule becomes the final parse.

[14] For example, if the lexical rule approach were adopted for derivation, this problem could not be solved.



PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

PhD Thesis: Chapter I Introduction

PhD Thesis: Chapter II Role of Grammar

PhD Thesis: Chapter III Design of CPSG95

PhD Thesis: Chapter IV Defining the Chinese Word

PhD Thesis: Chapter V Chinese Separable Verbs

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

PhD Thesis: Chapter VII Concluding Remarks

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP


PhD Thesis: Chapter V Chinese Separable Verbs


5.0. Introduction

This chapter investigates the phenomena usually referred to as separable verbs (离合动词 lihe dongci) in the form V+X.  Separable verbs constitute a significant portion of Chinese verb vocabulary.[1]  These idiomatic combinations seem to show dual status (Z. Lu 1957; L. Li 1990).  When V+X is not separated, it is like an ordinary verb.   When V is separated from X, it seems to be more like a phrasal combination.  The co-existence of both the separated use and contiguous use for these constructions is recognized as a long-standing problem at the interface of Chinese morphology and syntax (L. Wang 1955;  Z. Lu 1957; Chao 1968; Lü 1989; Lin 1983;  Q. Li 1983; L. Li 1990; Shi 1992; Dai 1993; Zhao and Zhang 1996).

Some linguists (e.g. L. Li 1990; Zhao and Zhang 1996) have made efforts to classify different types of separable verbs and demonstrated different linguistic facts about these types.  There are two major types of separable verbs:  V+N idioms with the verb-object relation and V+A/V idioms with the verb-modifier relation – when X is A or non-conjunctive V.[2]

The V+N idiom is a typical case which demonstrates the mismatch between a vocabulary word and grammar word.  There have been three different views on whether V+N idioms are words or phrases in Chinese grammar.

Given the fact that the V and the N can be separated in usage, the most popular view (e.g. Z. Lu 1957; L. Li 1990; Shi 1992) is that they are words when V+N are contiguous and they are phrases otherwise.  This analysis fails to account for the link between the separated use and the contiguous use of the idioms.  In terms of the type of V+N idioms like 洗澡 xi zao (wash-bath: take a bath), this analysis also fails to explain why a different structural analysis should be given to this type of contiguous V+N idioms listed in the lexicon than the analysis to the also contiguous but non-listable combination of V and N (e.g. 洗碗 xi wan ‘wash dishes’).[3]  As will be shown in Section 5.1, the structural distribution for this type of V+N idioms and the distribution for the corresponding non-listable combinations are identical.

Other grammarians argue that V+N idioms are not phrases (Lin 1983;  Q. Li 1983; Zhao and Zhang 1996).  They insist that they are words, or a special type of words.  This argument cannot explain the demonstrated variety of separated uses.

There are scholars (e.g. Lü 1989; Dai 1993) who indicate that idioms like 洗澡 xi zao are phrases.  Their judgment is based on their observation of the linguistic variations demonstrated by such idioms.  But they have not given detailed formal analyses which account for the difference between these V+N idioms and the non-listable V+NP constructions in the semantic compositionality.  That seems to be the major reason why this insightful argument has not convinced people with different views.

As for V+A/V idioms, Lü (1989) offers a theory that these idioms are words and the insertable signs between V and A/V are Chinese infixes.  This is an insightful hypothesis.  But as in the case of the analyses proposed for V+N idioms, no formal solutions have been proposed based on the analyses in the context of phrase structure grammars.  As a general goal, a good solution should not only be implementable, but also offer an analysis which captures the linguistic link, both structural and semantic, between the separated use and the contiguous use of separable verbs.  It is felt that there is still a distance between the proposed analyses reported in literature and achieving this goal of formally capturing the linguistic generality.

Three types of V+X idioms can be classified based on their different degrees of ‘separability’ between V and X, to be explored in three major sections of this chapter.  Section 5.1 studies the first type of V+N idioms like 洗澡 xi zao (wash-bath: take a bath).  These idioms are freely separable.  It is a relatively easy case.  Section 5.2 investigates the second type of the V+N idioms represented by 伤心 shang xin (hurt-heart: sad or heartbroken).  These idioms are less separable.  This category constitutes the largest part of the V+N phenomena.  It is a more difficult borderline case.  Section 5.3 studies the V+A/V idioms.  These idioms are least separable:  only the two modal signs 得 de3 (can) and 不 bu (cannot) can be inserted inside them, and nothing else.  For all these problems, arguments for the wordhood judgment will be presented first.  A corresponding morphological or syntactic analysis will be proposed, together with the formulation of the solution in CPSG95 based on the given analysis.

5.1. Verb-object Idioms: V+N I

The purpose of this section is to analyze the first type of V+N idioms, represented by 洗澡 xi zao (wash‑bath: take a bath).  The basic arguments to be presented are that they are verb phrases in Chinese syntax and the relationship between the V and the N is syntactic.  Based on these arguments, formal solutions to the problems involved in this construction will be presented.

The idioms like 洗澡 xi zao are classified as V+N I, to be distinguished from another type of idioms V+N II (see 5.2).  The following is a sample list of this type of idioms.

(5-1.) V+N I: xi zao type

洗澡 xi (wash) zao (bath #)              take a bath
擦澡 ca (scrub) zao (bath #)             clean one’s body by scrubbing
吃亏 chi (eat) kui (loss #)                   get the worst
走路 zou (go) lu (way $)                      walk
吃饭 chi (eat) fan (rice $)                    have a meal
睡觉 shui (V:sleep) jiao (N:sleep #)   sleep
做梦 zuo (make) meng (N:dream)     dream (a dream)
吵架  chao (quarrel) jia (N:fight #)    quarrel (or have a row)
打仗 da (beat) zhang (battle)              fight a battle
上当 shang (get) dang (cheating #)                be taken in
拆台 chai (pull down) tai (platform #)          pull away a prop
见面 jian (see) mian (face #)                            meet (face to face)
磕头 ke (knock) tou (head)                              kowtow
带头 dai (lead) tou (head $)                            take the lead
帮忙 bang (help) mang (business #)              give a hand
告状 gao (sue) zhuang (complaint #)            lodge a complaint

Note: Many nouns (marked with # or $) in this type of constructions cannot be used independently of the corresponding V.[4]  But those with the mark $ have no such restriction in their literal sense.  For example, when the sign fan  means ‘meal’, as it does in the idiom, it cannot be used in a context other than the idiom chi-fan (have a meal).  Only when it stands for the literal meaning ‘rice’, it does not have to co-occur with  chi.

There is ample evidence for the phrasal status of the combinations like 洗澡 xi zao.  The evidence is of three types.  The first comes from the free insertion of some syntactic constituent X between the idioms in the form V+X+N: this involves keyword-based judgment patterns and other X‑insertion tests proposed in Chapter IV.  The second type of evidence resorts to some syntactic processes for the transitive VP, namely passivization and long-distance topicalization.  The V+N I idioms can be topicalized and passivized in the same way as ordinary transitive VP structures do.  The last piece of evidence comes from the reduplication process associated with this type of idiom.   All the evidence leads to the conclusion that V+N I idioms are syntactic in nature.

The first evidence comes from using the wordhood judgment pattern: V(X)+zhe/guo à word(X).  It is a well observed syntactic fact that Chinese aspectual markers appear right after a lexical verb (and before the direct object).  If 洗澡 xi zao were a lexical verb, the aspectual markers would appear after the combinations, not inside them.  But that is not the case, shown by the ungrammaticality of the example in (5-2b).  A productive transitive VP example is given in (5-3) to show its syntactic similarity (parallelness) with V+N I idioms.

(5-2.) (a)      他正在洗着澡
ta       zheng-zai    xi      zhe    zao.
he      right-now    wash ZHE   bath
He is taking a bath right now.

(b) *   他正在洗澡着。
ta       zheng-zai    xizao         zhe.
he      right-now    wash-bath   ZHE

(5-3.) (a)      他正在洗着衣服。
ta       zheng-zai    xi      zhe    yi-fu.
he      right-now    wash ZHE   clothes
He is washing the clothes right now.

(b) *   他正在洗衣服着。
ta       zheng-zai    xi      yi-fu           zhe.
he      right-now    wash clothes        ZHE

The above examples show that the aspectual marker 着 zhe (ZHE) should be inserted in the V+N idiom, just as it does in an ordinary transitive VP structure.

Further evidence for X-insertion is given below.   This comes from the post-verbal modifier of ‘action-times’ (动量补语 dongliang buyu) like ‘once’, ‘twice’, etc.  In Chinese, action-times modifiers appear after the lexical verb and aspectual marker (but before the object), as shown in (5-4a) and (5-5a).

(5-4.) (a)      他洗了两次澡。
ta       xi      le       liang  ci       zao.
he      wash LE     two    time   bath
He has taken a bath twice.

(b) *   他洗澡了两次。
ta       xizao         le       liang  ci.
he      wash-bath   LE     two    time

(5-5.) (a)      他洗了两次衣服。
ta       xi      le       liang  ci       yi-fu.
he      wash LE     two    time   clothes
He has washed the clothes twice.

(b) *   他洗衣服了两次。
ta       xi      yi-fu           le       liang  ci.
he      wash clothes        LE     two    time

So far, evidence has been provided of syntactic constituents which are attached to the verb in the V+N I idioms.  To further argue for the VP status of the whole idiom, it will be demonstrated that the N in the V+N I idioms in fact fills the syntactic NP position in the same way as all other objects do in Chinese transitive VP structures.  In fact, N in the V+N I does not have to be a bare N:  it can be legitimately expanded to a full-fledged NP (although it does not normally do so).  A full-fledged NP in Chinese typically consists of a classifier phrase (and modifiers like de-construction) before the noun.  Compare the following pair of examples.  Just like an ordinary NP 一件崭新的衣服 yi jian zan-xin de yi-fu (one piece of brand-new clothes), 一个痛快的澡 yi ge tong-kuai de zao (a comfortable bath) is a full-fledged NP.

(5-6.)           他洗了一个痛快的澡。
ta       xi      le       yi       ge      tong-kuai     de      zao.
he      wash LE     one    CLA   comfortable DE     bath
He has taken a comfortable bath.

(5-7.)           他洗了一件崭新的衣服。
ta       xi      le       yi       jian    zan-xin        de      yi-fu.
he      wash LE     one    CLA   brand-new  DE     clothes
He has washed one piece of brand-new clothes.

It requires attention that the above evidence is directly against the following widespread view, i.e. signs like 澡 zao, marked with # in (5-1), are ‘bound morphemes’ or ‘bound stems’ (e.g. L. Li 1990; Zhao and Zhang 1996).  As shown, like every other free morpheme noun (e.g. yi-fu), zao holds a lexical position in the typical Chinese NP sequence ‘determiner + classifier + (de-construction) + N’, e.g. 一个澡 yi ge zao (a bath), 一个痛快的澡 yi ge tong-kuai de zao (a comfortable bath).[5]  In fact, as long as the ‘V+N I phrase’ arguments are accepted (further evidence to come), by definition ‘bound morpheme’ is a misnomer for 澡 zao.  As a part of morphology, a bound morpheme cannot play a syntactic role:  it is inside a word and cannot be seen in syntax.  The analysis of 洗xi (…) zao as a phrase entails the syntactic roles played by 澡 zao:  (i) 澡 zao is a free morpheme noun which fills the lexical position as the final N inside the possibly full-fledged NP;  (ii) 澡zao plays the object role in the syntactic transitive structure 洗澡xi zao.

This bound morpheme view is an argument used for demonstrating  the relevant V+N idioms to be words rather than phrases (e.g. L. Li 1990).  Further examination of this widely accepted view will help to strengthen the counter-arguments that all V+N I idioms are phrases.

Labeling signs like 澡zao (bath) as bound morphemes seem to come from an inappropriate interpretation of the statement that bound morphemes cannot be ‘freely’, or ‘independently’, used in syntax.[6]  This interpretation places an equal sign between the idiomatic co-occurrence constraint and ‘not being freely used’.  It is true that 澡zao is not an ordinary noun to be used in isolation.  There is a co-occurrence constraint in effect:  澡zao cannot be used without the appearance of 洗xi (or 擦ca).  However, the syntactic role played by 澡zao, the object in the syntactic VP structure, has full potential of being ‘freely’ used as any other Chinese NP object:   it can even be placed before the verb in long-distance constructions as shall be shown shortly.  A more proper interpretation of ‘not being freely used’ in terms of defining bound morphemes should be that a genuine bound morpheme, e.g. the suffix 性 -xing ‘-ness’, has to attach to another sign contiguously to form a word.

A comparison with similar phenomena in English may be helpful.  English also has similar idiomatic VPs, such as kick the bucket.[7]  For the same reason, it cannot be concluded that bucket (or the bucket) is a bound morpheme only because it demonstrates necessary co-occurrence with the verb literal kick.  Signs like bucket, zao (bath) are not of the same nature as bound morphemes like –less, -ly, un-, ‑xing (-ness), etc

The second type of evidence shows some pattern variations for the V+N I idioms.  These variations are typical syntactic patterns for the transitive V+NP structure in Chinese.  One of most frequently used patterns for transitive structures is the topical pattern of long distance dependency.  This provides strong evidence for judging the V+N I idioms as syntactic rather than morphological.  For, with the exception of clitics, morphological theories in general conceive of the parts of a word as being contiguous.[8]  Both the V+N I idiom and the normal V+NP structure can be topicalized, as shown in (5-8b) and (5-9b) below.

(5-8.) (a)      我认为他应该洗澡。
wo     ren-wei        ta       ying-gai       xi zao.
I         think           he      should        wash-bath
I think that he should take a bath.

(b)      澡我认为他应该洗
zao    wo     ren-wei        ta       ying-gai       xi.
bath  I         think           he      should        wash
The bath I think that he should take.

(5-9.) (a)       我认为他应该洗衣服。
wo     ren-wei        ta       ying-gai       xi      yi-fu.
I         think           he      should        wash clothes
I think that he should wash the clothes.

(b)      衣服我认为他应该洗。
yi-fu           wo     ren-wei        ta       ying-gai       xi.
clothes        I         think           he      should        wash
The clothes I think that he should wash.

The minimal pair of passive sentences in (5-10) and (5‑11) further demonstrates the syntactic nature of the V+N I structure.

(5-10.)         澡洗得很干净。
zao             xi      de3    hen    gan-jing.
bath            wash DE3   very   clean
A good bath was taken so that one was very clean.

(5-11.)         衣服洗得很干净。
yi-fu           xi      de3    hen    gan-jing.
clothes        wash DE3   very   clean
The clothes were washed clean.

The third type of evidence involves the nature of reduplication associated with such idioms.  For idioms like 洗澡 xi zao (take a bath), the first sign can be reduplicated to denote the shortness of the action:  洗澡 xi zao (take a bath) –> 洗洗澡 xi xi zao (take a short bath).  If 洗澡 xi zao is a word, by definition, 洗xi is a morpheme inside the word and 洗洗澡 xi-xi-zao belongs to morphological reduplication (AB–>AAB type).  However, this analysis fails to account for the generality of such reduplication:  it is a general rule in Chinese grammar that a verb reduplicates itself contiguously to denote the shortness of the action.  For example, 听音乐 ting (listen to) yin-yue (music) –> 听听音乐 ting ting yin-yue (listen to music for a while); 休息 xiu-xi (rest) –> 休息休息 xiu-xi xiu-xi (have a short rest), etc.  On the other hand, when we accept that 洗澡 xi zao is a verb-object phrase in syntax and the nature of this reduplication is accordingly judged as syntactic,[9] we come to a satisfactory and unified account for all the related data.  As a result, only one reduplication rule is required in CPSG95 to capture the general phenomena;[10]  there is no need to do anything special for V+N  idioms.

This AB ‑‑> AAB type reduplication problem for the V+N idioms poses a big challenge to traditional word segmenters (Sun and Huang 1996).  Moreover, even when a word segmenter successfully incorporates some procedure to cope with this problem, the essentially same rule has to be repeated in the grammar for the general VV reduplication.  This is not desirable in terms of capturing the linguistic generality.

All the evidence presented above indicates that idioms like 洗澡xi zao, no matter whether V and N are used contiguously or not, are not words, but phrases.  The idiomatic nature of such combinations seems to be the reason why most native speakers, including some linguists, regard them as words.  Lü (1989: 113-114) suggests that vocabulary words  like 洗澡 xi zao should be distinguished from grammar words.  He was one of the first Chinese grammarians who found that the V+N relation in the idioms like 洗澡 xi zao is a syntactic verb object relation.  But he did not provide full arguments for his view, neither did he offer a precise formalized analysis of this problem.[11]

As shown in the previous examples, the V+N I idioms do not differ from other transitive verb phrases in all major syntactic behaviors.   However, due to their idiomatic nature, the V+N I idioms are different from ordinary transitive VPs in the following two major aspects.  These differences need to be kept in mind when formulating the grammar to capture the phenomena.

  • Semantics:  the semantics of the idiom should be given directly in the lexicon, not as a result of the computation of the semantics of the parts based on some general principle of compositionality.
  • Co-occurrence requirement:  洗 xi (or 擦 ca) and 澡 zao must co-occur with each other;  走 zou (go) and 路 lu (way) must co-occur; etc.  This is a requirement specific to the idioms at issue.  For example, 洗 xi and 澡 zao must co-occur in order to stand as an idiom to mean ‘take a bath’.

Based on the study above, the CPSG95 solution to this problem is described below.  In order to enforce the co-occurrence of the V+N I idioms, it is specified in the CPSG95 lexicon that the head V obligatorily expects as its object an NP headed by a specific literal.  This treatment originates from the practice of handling collocations in HPSG.  In HPSG, there are features designed to enable the subcategorization for particular words, or phrases headed by particular words.  For example, the feature [NFORM there] and [NFORM it] refer to the expletive there and it respectively for the special treatment of existential constructions, cleft constructions, etc. (Pollard and Sag 1987:62).  The values of the feature PFORM distinguish individual prepositions like for, on, etc.  They are used in phrasal verbs like rely on NP, look for NP, etc.  In CPSG95, this approach is being generalized, as described below.

As presented before, the feature for orthography [HANZI] records the Chinese character string for each lexical sign.  When a specific lexical literal is required in an idiomatic expectation, the constraint is directly placed on the value of the feature [HANZI] of the expected sign, in addition to possible other constraints.  It is standard practice in a lexicalized grammar that the expected complement (object) for the transitive structure be coded directly in the entry of the head V in the lexicon.  Usually, the expected sign is just an ordinary NP.  In the idiomatic VP like 洗 xi (…) 澡 zao, one further constraint is placed:  the expected NP must be headed by the literal character 澡zao.  This treatment ensures that all pattern variations for transitive VP such as passive constructions, topicalized constructions, etc. in Chinese syntax will equally apply to the V+N I idioms.[12]

The difference in semantics is accommodated in the feature [CONTENT] of the head V with proper co-indexing.  In ordinary cases like 洗衣服 xi yi-fu (wash clothes), the argument structure is [vt_semantics] which requires two arguments, with the role [ARG2] filled by the semantics of the object NP.  In the idiomatic case 洗澡 xi zao (take a bath), the V and N form a semantic whole, coded as [RELN take_bath].[13]  The V+N I idioms are formulated like intransitive verbs in terms of composing the semantics – hence coded as [vi_semantics], with only one argument to be co-indexed with the subject NP.  Note that there are two lexical entries in the lexicon for the verb 洗 xi (wash), one for the ordinary use and the other for the idiom, shown in (5-12) and (5-13).


The above solution takes care of the syntactic similarity of the
V+N I idioms and ordinary V+NP structures.  It is also detailed enough to address their major differences.  In addition, the associated reduplication process (i.e. V+N –> V+V+N) is no longer a problem once this solution is adopted.  As the V in the V+N idioms is judged and coded as a lexical V (word) in this proposal, the reduplication rule which handles V –> VV will equally apply here.

5.2. Verb-object Idioms: V+N II

The purpose of this section is to provide an analysis of another type of V+N idiom and present the solution implemented in CPSG95 based on the analysis.

Examples like 洗澡 xi zao (take a bath) are in fact easy cases to judge.   There are more marginal cases.  When discussing Chinese verb-object idioms, L. Li (1990) and Shi (1992) indicate that the boundary between a word and a phrase in Chinese is far from clear-cut.  There is a remarkable “gray area” in between.  Examples in (5-14) are V+N II idioms, in contrast to the V+N I type, classified by L. Li (1990).

(5-14.) V+N II: 伤心 shang xin type

伤心 shang (hurt) xin (heart)             sad or break one’s heart
担心 dan (carry) xin (heart)               worry
留神 liu (pay) shen (attention)           pay attention to
冒险 mao (take) xian (risk)                 take the risk
借光 jie (borrow) guang (light)           benefit from
劳驾 lao (bother) jia (vehicle)             beg the pardon
革命 ge (change) ming (life)                 make revolution
落后 luo (lag) hou (back)                      lag behind
放手 fang (release) shou (hand)          release one’s hold

Compared with V+N I (洗澡xi zao type), V+N II has more characteristics of a word.  The lists below given by L. Li (1990) contrast their respective characteristics.[14]

(5-15.) V+N I (based on L. Li 1990:115-116)

as a word


(a1) corresponds to one generalized sense (concept)

(a2) usually contains ‘bound morpheme(s)’

as a phrase



(b1) may insert an aspectual particle (X=le/zhe/guo)

(b2) may insert all types of post-verbal modifiers (X=BUYU)

(b3) may insert a pre-nominal modifier de-construction (X=DEP)

(5-16.) V+N II (based on L. Li 1990:115)

as a word



(a1) corresponds to one generalized sense (concept)

(a2) usually contains ‘bound morpheme(s)’

(a3) (some) may be followed by an aspectual particle (X=le/zhe/guo)

(a4) (some) may be followed by a post-verbal modifier
of duration or number of times (X=BUYU)

(a5) (some) may take an object (X=BINYU)

as a phrase



(b1) may insert an aspectual particle (X=le/zhe/guo)

(b2) may insert all types of post-verbal modifiers (X=BUYU)

(b3) may insert a pre-nominal modifier de-construction (X=DEP)

For V+N I, the previous text has already given detailed analysis and evidence and decided that such idioms are phrases, not words.  This position is not affected by the demonstrated features (a1) and (a2) in (5‑15);  as argued before,  (a1) and (a2) do not contribute to the definition of a grammar word.

However, (a3), (a4) and (a5) are all syntactic evidence showing that V+N II idioms can be inserted in lexical positions.   On the other hand, these idioms also show the similarity with V+N I idioms in the features (b1), (b2) and (b3) as a phrase.  In particular, (a3) versus (b1) and (a4) versus (b2) demonstrate a ‘minimal pair’ of phrase features and word features.  The following is such a minimal pair example (with the same meaning as well) based on the feature pairs (a3) versus (b1), with a post-verbal modifier 透tou (thorough) and aspectual particle 了le (LE).  It demonstrates the borderline status of such idioms.  As before, a similar example of an ordinary transitive VP is also given below for comparison.

(5-17.)         V+N II: word or phrase?

伤心:sad; heart-broken
shang          xin
hurt            heart

(a)      我伤心透了
wo     shang-xin  tou              le.
I         sad              thorough     LE
I was extremely sad.

(b)      我伤透了心
wo     shang         tou              le       xin.
I         break          thorough     LE     heart
I was extremely sad.

(5-18.)         Ordinary V+NP phrase: 恨hen (hate) 他ta (he)

(a) *   我恨他透了
wo     hen   ta      tou              le.
I         hate   he      thorough     LE

(b)      我恨透了他
wo     hen   tou              le       ta.
I         hate   thorough     LE     he
I thoroughly hate him.

As shown in (5-18), in the common V+NP structure, the post-verbal modifier 透 tou (thorough) and the aspectual particle 了 le (perfect aspect) can only occur between the lexical V and NP.  But in many V+N II idioms, they may occur either after the V+N combination or in between.  In (5‑17a), 伤心 shang xin is in the lexical position because Chinese syntax requires that the post-verbal modifier attach to the lexical V, not to a VP as indicated in (5-18a).  Following the same argument, 伤 shang (hurt) alone in (5-17b) must be a lexical V as well.  The sign 心 xin (heart) in (5‑17b) establishes itself in syntax as object of the V, playing the same role as 他ta (he) in (5-18b).  These facts show clearly that V+N II idioms can be used both as lexical verbs and as transitive verb phrases.   In other words, before entering a context, while still in the lexicon, one can not rule out either possibility.

However, there is a clear cut condition for distinguishing its use as a word and its use as a phrase once a V+N II idiom is placed in a context.   It is observed that the only time a V+N II idiom assumes the lexical status is when V and N are contiguous.  In all other cases, i.e. when V and N are not contiguous, they behave essentially similar to the V+N I type.

In addition to the examples in (5-17) above, two more examples are given below to demonstrate the separated phrasal use of V+N II.  The first is the case V+X+N where X is a possessive modifier attached to the head N.  Note also the post-verbal position of 透 tou (thorough) and 了le (LE).  The second is an example of passivization when N occurs before V.  These examples provide strong evidence for the syntactic nature of V+N II idioms when V and N are not used contiguously.

(5-19.) (a) *   你伤他的心透了
ni       shang         ta       de      xin    tou              le.
you    hurt            she    DE     heart thorough     LE

(b)      你伤透了他的心
ni       shang         tou              le       ta       de      xin.
you    hurt            thorough     LE     she    DE     heart
You broke her heart.

(5-20.)         V+N II: instance of passive with or without 被 bei (BEI)

xin    (bei)   shang         tou              le.
heart BEI    break          thorough     LE
The heart was completely broken.
or: (Someone) was extremely sad.

Based on the above investigation, it is proposed in CPSG95 that two distinct entries be constructed for each such idiom, one as an inseparable lexical V, and the other as a transitive VP just like that of V+N I.  Each entry covers its own part of the phenomena.  In order to capture the semantic link between the two entries, a lexical rule called V_N_II Rule is formulated in CPSG95, shown in (5-21).


The input to the V_N_II Lexical Rule is an entry with [CATEGORY v_n_ii] where [v_n_ii] is a given sub-category in the lexicon for V+N II type verbs.  The output is another entry with the same information except for three features [HANZI], [CATEGROY] and [COMP1_RIGHT].  The new value for [HANZI] is a list concatenating the old [HANZI] and the [HANZI] for the expected [COMP1_RIGHT].  The new [CATEGORY] value is simply [v].  The value for [COMP1_RIGHT] becomes [null].  The outline of the two entries captured by this lexical rule are shown in (5-22) and (5-23).


It needs to be pointed out that the definition of [CATEGORY v_n_ii] in CPSG95 is narrower than L. Li’s definition of V+N II type idioms.  As indicated by L. Li (1990), not all V+N II idioms share the same set of lexical features (a3), (a4) and (a5) as a word.  The definition in CPSG95 does not include the idioms which share the lexical feature (a5), i.e. taking a syntactic object.  These are idioms like 担心danxin (carry-heart: worry about).  For such idioms, when they are used as inseparable compound words, they can take a syntactic object.  This is not possible for all other V+N idioms, as shown below.

(5-24.) (a)     她很担心你
ta       hen    dan-xin                ni.
he      very   worry (about)        you
He is very concerned about you.

(b) *   他很伤心你
ta       hen    shang-xin            ni.
he      very   sad                       you

In addition, these idioms do not demonstrate the full distributional potential of transitive VP constructions.  The separated uses of these idioms are far more limited than other V+N idioms.  For example, they can hardly be passivized or topicalized as other V+N idioms can, as shown by the following minimal pair of passive constructions.

(5-25.)(a) *   心(被)担透了
xin    (bei)   dan             tou              le.
heart BEI    carry           thorough     LE

(b)      心(被)伤透了
xin    (bei)   shang         tou              le.
heart BEI    break          thorough     LE
The heart was completely broken.
or: (Someone) was extremely sad.

In fact, the separated use (‘phrasal use’) for such V+N idioms seems only limited to some type of X-insertion, typically the appearance of aspect signs between V and N.[15]  Such separated use is the only thing shared by all V+N idioms, as shown below.

(5-26.)(a)     他担过心
ta       dan             guo    xin
he      carry           GUO  heart
He (once) was worried.

(b)      他伤过心
ta       shang         guo    xin
he      break          GUO  heart
He (once) was heart-broken.

To summarize,  the V+N idioms like 担心 dan-xin which can take a syntactic object do not share sufficient generality with other V+N II idioms for a lexical rule to capture.  Therefore, such idioms are excluded from the [CATEGORY v_n_ii] type.  This makes these idioms not subject to the lexical rule proposed above.  It is left for future research to answer the question whether there is enough generality among this set of idioms to justify some general approach to this problem, say, another lexical rule or some other ways of generalization of the phenomena.  For time being, CPSG95 simply lists both the contiguous and separated uses of these idioms in the lexicon.[16]

It is worth noticing that leaving such idioms aside, this lexical rule still covers large parts of V+N II phenomena.  The idioms like 担心dan-xin only form a very small set which are in the state of transition to words per se (from the angle of language development) but which still retain some (but not complete) characteristics of a phrase.[17]

5.3. Verb-modifier Idioms: V+A/V

This section investigates the V+X idioms in the form of V+A/V.  The data for the interaction of V+A/V idioms and the modal insertion are presented first.  The subsequent text will argue for Lü’s infix hypothesis for the modal insertion and accordingly propose a lexical rule to capture the idioms with or without modal insertion.

The following is a sample list of V+A/V idioms, represented by kan jian (look-see: have seen).

(5-27.) V+A/V: kan jian type

看见 kan (look) jian (see)                    have seen
看穿 kan (look) chuan (through)        see through
离开 li (leave) kai (off)                         leave
打倒 da (beat) dao (fall)                      down with
打败 da (beat) bai (fail)                       defeat
打赢 da (beat) ying (win)                    fight and win
睡着 shui (sleep) zhao (asleep)            fall asleep
进来 jin (enter) lai (come)                             enter
走开 zou (go) kai (off)                         go away
关上  guan (close) shang (up)             close

In the V+A/V idiom kan jian (have-seen), the first sign kan (look) is the head of the combination while the second jian (see) denotes the result.  So when we say, wo (I) kan-jian (see) ta (he), even without the aspectual marker le (LE) or guo (GUO), we know that it is a completed action:  ‘I have seen him’ or ‘I saw him’.[18]

Idioms like kan-jian (have-seen) function just as a lexical whole (transitive verb).  When there is an aspect marker, it is attached immediately after the idioms as shown in (5‑28).  This is strong evidence for judging V+A/V idioms as words, not as syntactic constructions.

(5-28.)         我看见了他
wo     kan jian     le       ta.
I         look-see       LE     he                   I have seen him.

The only observed separated use is that such idioms allow for two modal signs 得 de3 (can) and 不 bu (cannot) in between, shown by (5-29a) and (5-29b).  But no other signs, operations or processes can enter the internal structure of these idioms.

(5-29.) (a)     我看不见他
wo     kan bu jian         ta.
I         look cannot see     he
I cannot see him.

(c)      你看得见他吗?
ni       kan de3 jian       ta       me?
you    look can see          he      ME
Can you see him?

Note that English modal verbs ‘can’ and ‘cannot’ are used to translate these two modal signs.  In fact, Contemporary Mandarin also has corresponding modal verbs (能愿动词 neng-yuan dong-ci):  能 neng (can) and 不能 bu neng (cannot).  The major difference between Chinese modal verbs 能 neng / 不能 bu neng and the modal signs 得 de3 / 不 bu lies in their different distribution in syntax.  The use of modal signs 得 de3 (can) and 不 bu (cannot) is extremely restrictive:  they have to be inserted into V+BUYU combinations.  But Chinese modal verbs can be used before any VP structures.  It is interesting to see the cases when they are used together in one sentence, as shown in (5-30 a+b) below.  Note that the meaning difference between the two types of modal signs is subtle, as shown in the examples.

(5-30.)(a)     你看得见他吗?
ni       kan de3 jian         ta       me?
you    look can see          he      ME
Can you see him? (Is your eye-sight good enough?)

(b)      你能看见他吗?
ni       neng kan jian      ta       me?
you    can    see              he      ME
Can you see him?
(Note: This is used in more general sense. It covers (a) and more.)

(a+b)  你能看得见他吗?
ni       neng kan de3 jian         ta       me?
you    can    look can see          he      ME
Can you see him? (Is your eye-sight good enough?)

(5-31.)(a)     我看不见他
wo     kan bu jian           ta
I         look cannot see     he
I cannot see him. (My eye-sight is too poor.)

(b)      我不能看见他
wo     bu     neng kan jian      ta
I         not    can    see              he
I cannot see him. (Otherwise, I will go crazy.)

(a+b) 我不能看不见他
wo     bu     neng kan bu jian           ta.
I         not    can    look cannot see     he
I cannot stand not being able to see him.
(I have to keep him always within the reach of my sight.)

Lü (1989:127) indicates that the modal signs are in fact the only two infixes in Contemporary Chinese.  Following this infix hypothesis, there is a good account for all the data above.  In other words, the V+A/V idioms are V+BUYU compound words subject to the modal infixation.  The phenomena of 看得见 kan-de3-jian (can see) and 看不见 kan-bu-jian (cannot see) are therefore morphological by nature.  But Lü did not offer formal analysis for these idioms.

Thompson (1973) first proposed a lexical rule to derive the potential forms V+de3/bu+A/V from the V+A/V idioms.  The lexical rule approach seems to be most suitable for capturing the regularity of the V+A/V idioms and their infixation variants V+de3/bu+A/V.  The  approach taken in CPSG95 is similar to Thompson’s proposal.  More precisely, two lexical rules are formulated in CPSG95 to handle the infixation in V+A/V idioms.  This way, CPSG95 simply lists all V+A/V idioms in the lexicon as V+A/V type compound words, coded as [CATEGORY v_buyu].[19]  Such entries cover all the contiguous uses of the idioms.  It is up to the two lexical rules to produce two infixed entries to cover the separated uses of the idioms.

The change of the infixed entries from the original entry lies in the semantic contribution of the modal signs.  This is captured in the lexical rules in (5-32) and (5-33).  In case of V+de3+A/V, the Modal Infixation Lexical Rule I in (5-32) assigns the value [can] to the feature [MODAL] in the semantics.  As for V+bu+A/V, there is a setting  [POLARITY minus] used to represent the negation in the semantics, shown in (5-33).[20]


The following lexical entry shows the idiomatic compound 看见 kan-jian as coded in the CPSG95 lexicon (leaving some irrelevant details aside).   This entry satisfies the necessary condition for the proposed infixation lexical rules.


The modal infixation lexical rules will take this [v_buyu] type compound as input and produce two V+MODAL+BUYU entries.  As a result, new entries 看得见 kan-de3-jian (can see) and 看不见 kan-bu-jian (cannot see) as shown below are added to the lexicon.[21]



The above proposal offers a simple, effective way of capturing the linguistic data of the interaction of V+A/V idioms and the modal insertion, since it eliminates the need for any change of the general grammar in order to accommodate this type of separable verbs interacting with 得 de3 / 不 bu, the only two infixes in Chinese.

5.4. Summary

This chapter has conducted an inquiry into the linguistic phenomena of Chinese separable verbs, a long-standing difficult problem at the interface of Chinese compounding and syntax.   For each type of separable verb, arguments for the wordhood judgment have been presented.  Based on this judgment, CPSG95 provides analyses which capture both structural and semantic aspects of the constructions at issue.  The proposed solutions are formal and implementable.  All the solutions provide a way of capturing the link between the separated use and contiguous use of the V+X idioms.  The proposals presented in this chapter cover the vast majority of separable verbs.  Some unsolved rare cases or potential problems are also identified for further research.



[1] They are also called phrasal verbs (duanyu dongci) or compound verbs (fuhe dongci) among Chinese grammarians.  For linguists who believe that they are compounds, the V+N separable verbs are often called verb object compounds and the V+A/V separable verbs resultative compounds.  The want of a uniform term for such phenomena reflects the borderline nature of these cases.  According to Zhao and Zhang (1996), out of the 3590 entries in the frequently used verb vocabulary, there are 355 separable V+N idioms.

[2] As the term ‘separable verbs’ gives people an impression that these verbs are words (which is not necessarily true), they are better called V+X (or V+N or V+A/V) idioms.

[3] There is no disagreement among Chinese grammarians for the verb-object combinations like xi wan:  they are analyzed as transitive verb phrases in all analyses, no matter whether the head V and the N is contiguous (e.g. xi wan ‘wash dishes’) or not (e.g. xi san ge wan ‘wash three dishes’).

[4] Such signs as zao (bath), which are marked with # in (5-1), are often labeled as ‘bound morphemes’ among Chinese grammarians, appearing only in idiomatic combinations like xi zao (take a bath), ca zao (clean one’s body by scrubbing).  As will be shown shortly, bound morpheme is an inappropriate classification for these signs.

[5] It is widely acknowledged that the sequence num+classifier+noun is one typical form of Chinese NP in syntax.  The argument that zao is not a bound morpheme does not rely on any particular analysis of such Chinese NPs.  The fact that such a combination is generally regarded as syntactic ensures the validity of this argument.

[6] The notion ‘free’ or ‘freely’ is linked to the generally accepted view of regarding word as a minimal ‘free’ form, which can be traced back to classical linguistics works such as Bloomfield (1933).

[7] It is generally agreed that idioms like kick the bucket are not compounds but phrases (Zwicky 1989).

[8] That is the rationale behind the proposal of inseparability as important criterion for wordhood judgment in Lü (1989).

[9] In Chinese, reduplication is a general mechanism used both in morphology and syntax.  This thesis only addresses certain reduplication issues when they are linked to the morpho-syntactic problems under examination, but cannot elaborate on the Chinese reduplication phenomena in general.  The topic of Chinese reduplication deserves the study of a full-length dissertation.     

[10] In the ALE implementation of CPSG95, there is a VV Diminutive Reduplication Lexical Rule in place for phenomena like xi zao (take a bath) à xi xi zao (take a short bath);  ting yin-yue (listen to music) à ting ting yin-yue (listen to music for a while);  xiu-xi (rest) à xiu-xi xiu-xi (have a short rest).

[11] He observes that there are two distinct principles on wordhood.  The vocabulary principle requires that a word represent an integrated concept, not the simple composition of its parts.  Associated with the above is a tendency to regard as a word a relatively short string.  The grammatical principle, however, emphasizes the inseparability of the internal parts of a combination.  Based on the grammatical principle, xi zao is not a word, but a phrase.  This view is very insightful.

[12] The pattern variations are captured in CPSG95 by lexical rules following the HPSG tradition.  It is out of the scope of this thesis to present these rules in the CPSG95 syntax.  See W. Li (1996) for details.

[13] In the rare cases when the noun zao is realized in a full-fledged phrase like yi ge tong-kuai de zao (a comfortable bath), we may need some complicated special treatment in the building of the semantics.  Semantically, xi (wash) yi (one) ge (CLA) tong‑kuai (comfortable) de (DE) zao (bath): ‘take a comfortable bath’ actually means tong‑kuai (comfortable) de2 (DE2) xi (wash) yi (one) ci (time) zao (bath): ‘comfortably take a bath once’.  The syntactic modifier of the N zao is semantically a modifier attached to the whole idiom.  The classifier phrase of the N becomes the semantic ‘action-times’ modifier of the idiom.  The elaboration of semantics in such cases is left for future research.

[14] The two groups classified by L. Li (1990) are not restricted to the V+N combinations.  In order not to complicate the case,  only the comparison of the two groups of V+N idioms are discussed here.  Note also that in the tables, he used the term ‘bound morpheme’ (inappropriately) to refer to the co-occurrence constraint of the idioms.

[15] Another type of X-insertion is that N can occasionally be expanded by adding a de‑phrase modifier.  However, this use is really rare.

[16] Since they are only a small, easily listable set of verbs, and they only demonstrate limited separated uses (instead of full pattern variations of a transitive VP construction), to list these words and all their separated uses in the lexicon seems to be a better way than, say, trying to come up with another lexical rule just for this small set.  Listing such idiosyncratic use of language in the lexicon is common practice in NLP.

[17] In fact, this set has been becoming smaller because some idioms, say zhu-yi ‘focus-attention: pay attention to’, which used to be in this set, have already lost all separated phrasal uses and have become words per se.  Other idioms including dan-xin (worry about) are in the process of transition (called ionization by Chao 1968) with their increasing frequency of being used as words.   There is a fairly obvious tendency that they combine more and more closely as words, and become transparent to syntax.  It is expected that some, or all, of them will ultimately become words proper in future, just as zhu-yi did.

[18] In general, one cannot use kan-jian to translate English future tense ‘will see’, instead one should use the single-morpheme word kan:  I will see him –> wo (I) jiang (will) kan (see) ta (he).

[19] Of course, [v_buyu] is a sub-type of verb [v].

[20] The use of this feature for representing negation was suggested in  Footnote 18 in Pollard and Sag (1994:25)

[21] This is the procedural perspective of viewing the lexical rules.  As pointed out by Pollard and Sag (1987:209), “Lexical rules can be viewed from either a declarative or a procedural perspective: on the former view, they capture generalizations about static relationships between members of two or more word classes; on the latter view, they describe processes which produce the output from the input form.”



PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

PhD Thesis: Chapter I Introduction

PhD Thesis: Chapter II Role of Grammar

PhD Thesis: Chapter III Design of CPSG95

PhD Thesis: Chapter IV Defining the Chinese Word

PhD Thesis: Chapter V Chinese Separable Verbs

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

PhD Thesis: Chapter VII Concluding Remarks

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

PhD Thesis: Chapter IV Defining the Chinese Word


4.0. Introduction

This chapter examines the linguistic definition of the Chinese word and establishes its formal representation in CPSG95.  This lays a foundation for the treatment of Chinese morpho-syntactic interface problems in later chapters.

To address issues on interfacing morphology and syntax in Chinese NLP, the fundamental question is:  what is a Chinese word?  A proper answer to this question defines the boundaries between morphology, the study of how morphemes combine into words, and syntax, the study of how words combine into phrases.  However, there is no easy answer to this question.

In fact, how to define Chinese words has been a central topic among Chinese grammarians for decades (Hu and Wen 1954; L. Wang 1955;  Z. Lu 1957; Lin 1983; Lü 1989; Shi 1992; Dai 1993; Zhao and Zhang 1996).  In late 50’s, there was a heated discussion on the definition of Chinese word in China.  This discussion was induced by the campaign for the Chinese writing system reform (文字改革运动).  At that time, the government policy was to ultimately replace the Chinese characters (hanzi) by a Romanized writing system.  The system of pinyin, based on the Latin alphabet, was designed to represent the pronunciation of the characters in the Contemporary Mandarin.  The simplest way is to use pinyin as a writing system and simply translate Chinese characters into syllables in pinyin.  But it was soon found impractical due to the many-to-one correspondence from hanzi to syllable.  Text in pinyin with no  explicit word boundary delimiters is hardly comprehensible.   Linguists agree that the key issue for the feasibility of a pinyin-based writing system is to establish a standard or definition for Chinese words (Z. Lu 1957).  Once words can be identified by a common standard, the pinyin system can in principle be adopted for recording the Chinese language by using space and punctuation marks to separate words.  This is because the number of homophones at the word level is dramatically reduced when compared to the number of homophones at the hanzi (morpheme or monosyllabic) level.

But the definition of a Chinese word is a very complicated issue due to the existence of a considerable amount of borderline cases.  It has never been possible to reach a precise definition which can be applied to all circumstances and which can be accepted by linguists from different schools.

There have been many papers addressing the Chinese wordhood issue (e.g. Z. Lu 1957; Lin 1983; Lü 1989; Dai 1993).  Although there are still many problems in defining Chinese words for borderline cases and more debate will continue for many years to come, the understanding of Chinese wordhood has been deepened in the general acknowledgement of the following key aspects:  (i) the distinct status of Chinese morphology;  (ii) the distinction of different notions of word;  and (iii) the lack of absolute definition across systems or theories.

Almost all Chinese grammarians agree that unlike Classical Chinese, Contemporary Chinese is not based on single-morpheme words.   In other words, the word and the morpheme are no longer coextensive in Contemporary Chinese.[1]  In fact, that is the reason why we need to define Chinese morphology.  If the word and the morpheme stand for the same linguistic object in a language, like Classical Chinese, the definition of  morpheme will entail the definition of word and there is no role of morphology.

As it stands, there is little debate on the definition of morpheme in Chinese.  It is generally acknowledged that each syllable (or its corresponding written form hanzi) corresponds to (at least) one morpheme.  In a characteristic ‘isolating language’ – Classical Chinese is close to this, there is no or very poor morphology.[2]  However, Contemporary Chinese contains a significant number of bound morphemes in word formation (Dai 1993).  In particular, it is observed that many affixes are highly productive (Lü et al 1980).

It is widely acknowledged that the grammar of Contemporary Chinese is not complete without the component of morphology (Z. Lu 1957; Chao 1968; Li and Thompson 1981; Dai 1993; etc.).   Based on this widely accepted assumption, one major task for this thesis is to argue for the proper place to cut the line between morphology and syntax, and to explore effective ways of interleaving the two for analysis.

A significant development concerning the Chinese wordhood study is the  distinction between two different notions of word:  grammar word versus vocabulary word.  It is now clear that in terms of grammar analysis, a vocabulary word is not an appropriate notion (Lü 1989; more discussion to come in 4.1).

Decades of debate and discussion on the definition of a Chinese word have also shown that an operational definition for a grammar word precise enough to apply to all cases can hardly be established across systems or theories.  But a computational grammar of Chinese cannot be developed without precise definitions.  This leads to an argument in favor of the system internal wordhood definition and the interface coordination within a grammar.

The remaining sections of this chapter are organized like this.  Section 4.1 examines two notions of word.  Making sure that we use the right notion based on some appropriate guideline, some operational methods for judging a Chinese grammar word will be developed in 4.2.  Section 4.3 demonstrates the formal representation of a word in CPSG95.  This formalization is based on the design of expectation feature structures and the structural feature structure  presented in Chapter III.

4.1. Two Notions of Word

This section examines the two notions of word which have caused confusion.  The first notion, namely vocabulary word, is easy to define.  However, for the second notion, namely, grammar word, unfortunately,  no operational definition has been available.  It will be argued that a feasible alternative is to system internally define a grammar word and the labor division between Chinese morphology and syntax.

A grammar word stands for the grammatical unit which fits in the hierarchy of morpheme, word and phrase in linguistic analysis.  This gives the general concept of this notion but it is by no means an operational definition.  Vocabulary word, on the other hand, refers to the listed entry in the lexicon.  This definition is simple and unambiguous once a lexicon is given.  The lexical lookup will generate vocabulary words as potential building blocks for analysis.

On one hand, vocabulary words come from the lexicon;  they are basic building blocks for linguistic analysis.  On the other hand, as the ‘resulting’ unit for morphological analysis as well as the ‘starting’ or ‘atomic’ unit for syntactic analysis, the grammar word is the notion for linguistic generalization.  But it is observed that a vocabulary word is not necessarily a grammar word and vice versa.  It is this possible mismatch between vocabulary word and grammar word that has caused a problem in both Chinese grammar research and Chinese NLP system development.

Lü (1989) indicates that not making a distinction between these two notions of word has caused considerable confusion on the definition of Chinese word in the literature.  He further points out that only the former notion should be used in the grammar research.

Di Sciullo and Williams (1987) have similar ideas on these two notions of word.  They indicate that a sign listable in the lexicon corresponds to no certain grammatical unit.[3]   It can be a morpheme, a (grammar) word, or a phrase including sentence.  Some examples of different kinds of Chinese vocabulary words are given below to demonstrate this insight.

(4-1.) sample Chinese vocabulary words

(a) 性           bound morpheme, noun suffix, ‘-ness’
(b) 洗           free morpheme or word, V: ‘wash’
(c) 澡           word (only used in idioms), N: ‘bath’
(d) 澡盆        compound word, N: ‘bath-tub’
(e) 洗澡        idiom phrase, VP: ‘take a bath’
(f) 他们         pronoun as noun phrase, NP: ‘they’
(g) 城门失火,殃及池鱼

idiomatic sentence, S:
‘When the gate of a city is on fire, the fish in the
canal around the gate is also endangered.’

The above signs are all Chinese vocabulary words.  But grammatically, they do not necessarily function as a grammar word.  For example, (4-1a) functions as a suffix, smaller than a word.  (4-1e) behaves like a transitive VP (see 5.1 for more evidence), and (4-1g) acts as a sentence, both larger than a word.  The consequence of mixing up these different units in a grammar is the loss of power for a grammar to capture the linguistic generality for each level of grammatical unit.

The definition of grammar word has been a contentious issue in general linguistics (Di Sciullo and Williams 1987).  Its precise definition is particularly difficult in Chinese linguistics as there is a considerable amount of phenomena marginal between Chinese morphology and syntax (Zhu 1985; L. Li 1990; Sun and Huang 1996).  The morpheme-word-phrase transition is a continuous band in the linguistic reality.  Different grammars may well cut the division differently.  As long as there is no contradiction in coordinating these objects within the grammar, there does not seem to exist absolute judgment on which definition is right and which is wrong.

It is generally agreed that a grammar word is the smallest unit in syntax (Lü 1989), as also emphasized by Di Sciullo and Williams (1987) on the ‘syntactic atomicity’ of word.[4]  But this statement only serves as a guideline in theory, it is not an operational definition for the following reason.  It is logically circular to define word, smallest unit in syntax, and syntax, study of how words combine into phrases, one upon the other.

To avoid this ‘circular definition’ problem, a feasible alternative is to system internally define grammar word and the labor division between Chinese morphology and syntax, as in the case of CPSG95.  Of course, the system internal definition still needs to be justified based on the proposed morphological or syntactic analysis of borderline phenomena in terms of capturing the linguistic generality.  More specifically, three things need to be done:  (i) argue for the analysis case by case, e.g. why a certain construction should be treated as a morphological or syntactic phenomenon, what linguistic generality is captured by such a treatment, etc.;  (ii) establish some operational methods for wordhood judgment to cover similar cases;  (iii) use formalized data structures to represent the linguistic units after the wordhood judgment is made.  Section 4.2 will handle task (ii) and Section 4.3 is devoted to the formal definition of word required by task (iii).   The task in (i) will be pursued in the remaining chapters.

Another important notion related to grammar word is unlisted word.  Conceptually, an unlisted word is a novel construction formed via morphological rules, e.g. a derived word like 可读性 ke-du-xing (-able-read-ness: readability), foolish-ness, a compound person name (given name + family name) such as John Smith, 毛泽东 mao-ze-dong (Mao Zedong).  Unlisted words are often rule-based.  This is where productive word formation sets in.

However, unlisted word is not a crystal clear notion, just like the underlying concept grammar word.  Many grammarians have observed that phrases and unlisted words in Chinese are formed under similar rules (e.g. Zhu 1985; J. Lu 1988).  As both syntactic constructions and unlisted words are rule based, it can be difficult to judge a significant amount of borderline constructions as morphological or syntactic.

There are fuzzy cases where a construction is regarded as a grammar word by one and judged as a syntactic construction by another.  For example, while san (three) ge (CLA) is regarded as a syntactic construction, namely numeral-classifier phrase, in many grammars including CPSG95, such constructions are treated as compound words by others (e.g. Chen and Liu 1992).  ‘Quasi-affixation’ presents another outstanding ‘gray area’ (see 6.2).

The difficulty in handling the borderline phenomena leads back to the argument that the labor division between Chinese morphology and syntax should be pursued system internally and argued case by case in terms of capturing the linguistic generality.  To implement the required system internal definition, it is desirable to investigate practical wordhood judgment methods in addition to case-by-case arguments.  Some judgment methods will be developed in 4.2.  Case-by-case arguments and analysis for specific phenomena will be presented in later chapters.  After the wordhood judgment is made, there is a need for the formal representation.  Section 4.3 defines the formal representation of word with illustrations.

4.2. Judgment Methods

This section proposes some operational wordhood judgment methods based on the notion of ‘syntactic atomicity’ (Di Sciullo and Williams 1987).  These methods should be applied in combination with arguments of the associated grammatical analysis.  In fact, whether a sign is judged as a morpheme, a grammar word or a phrase ultimately depends on the related grammatical analysis.  However, the operationality of these methods will help facilitate the later analysis for some individual problems and avoid unnecessary repetition of similar arguments.

Most methods proposed for Chinese wordhood judgment in the literature are not fully operational.  For example, Chao (1968) agrees with Z. Lu (1957) that a word can fill the functional frame of a typical syntactic structure.  Dai (1993) points out that this method may effectively separate bound morphemes from free words, it cannot differentiate between words and phrases, as phrases may also be positioned in a syntactic frame.  In fact, whether this method can indeed separate bound morphemes from free words is still a problem.  This method cannot be made operational unless the definition of ‘frame of a typical syntactic structure’ is given.  The judgment methods proposed in this section try to avoid this ‘lack of operationality’ problem.

Dai (1993) made a serious effort in proposing a series of methods for cutting the line between morphemes and syntactic units in Chinese.  These methods have significantly advanced the study of this topic.  However, Dai admits that there is limitation associated with these proposals.  While each proposed method provides a sufficient (but not necessary) condition for judging whether a unit is a morpheme,  none of the methods can further determine whether this unit is a word or a phrase.  For example, the method of syntactic independence tests whether a unit in a question can be used as a short answer to the question.  If yes, the syntactic independence is confirmed and this unit is not a morpheme inside a word.  Obviously, such a method tells nothing about the syntactic rank of the tested unit because a word, a phrase or clause can all serve as an answer to a question.  In order to achieve that, other methods and/or analyses need to be brought in.

The first judgment method proposed below involves passivization and topicalization tests.  In essence, this is to see whether a string involves syntactic processes.  As an atomic unit, the internal structure of a word is transparent to syntax.  It follows that no syntactic processes are allowed to exert effects on the internal structure of a word.[5]  As  passivization and topicalization are generally acknowledged to be typical syntactic processes, if a potential combination A+B is subject to passivization B+bei+A and topicalization B+…+NP+A, it can be concluded that A+B is not a word:   the relation between A and B must be syntactic.

The second method is to define an unambiguous pattern for the wordhood judgment, namely, judgment patterns.  Judgment patterns are by no means a new concept.  In particular, keyword based judgment patterns have been frequently used in the literature of Chinese linguistics as a handy way for deterministic word category detection (e.g. L. Wang 1955;  Zhu 1985; Lü 1989).

The following keyword (i.e. aspect markers) based patterns are proposed for  judging a verb sign.

(a) V(X)+着/过 –> word(X)
(b) V(X)+着/过/了+NP –> word(X)

The pattern (4-2a) states that if X is a sign of verb, no matter transitive or intransitive, appearing immediately before zhe/guo, then X is a word.  This proposal is backed by the following argument.  It is an important and widely acknowledged grammatical generalization in Chinese syntax that the aspect markers appear immediately after lexical verbs (Lü et al 1980).

Note that the aspect marker le (LE) is excluded from the pattern in (4-2a) because the same keyword le corresponds to two distinctive morphemes in Chinese:  the aspect le (LE) attaches to a lexical V while the sentence-final le (LEs) attaches to a VP (Lü et al 1980).  Therefore, judgment cannot be reliably made when a sentence ends in X+le, for example, when X is an intransitive verb or a transitive verb with the optional object omitted.  However, le in pattern (4-2b) has no problem since le is not in the ambiguous sentence final position.  This pattern says that if any of the three aspect markers appears between a sign X of verb and NP, X must be a word:  in fact, it is a lexical transitive verb.

There are two ways to use the judgment patterns.  If a sub-string of the input sentence matches a judgment pattern, one reaches the conclusion promptly.  If the input string does not match a pattern directly, one can still make indirect use of the patterns for judgment.  The idiomatic combination xi (wash) zao (bath) is a representative example.   Assume that the vocabulary word xi zao is a grammar word.  It follows that it should be able to fill in the lexical verb position in the judgment pattern (4-2a).  We then make a sentence which contains a substring matching the pattern to see whether it is grammatical.  The result is ungrammatical:  * 他洗澡着 ta (he) xi-zao (V) zhe (ZHE);  * 他洗澡过 ta (he) xi-zao (V) guo (GUO).  Therefore, our assumption must be wrong:  洗澡 xi zao is not a grammar word.  We then change the assumption and try to insert aspect markers inside them (it is in fact an expansion test, to be discussed shortly).  The new assumption is that the verb xi alone is a grammar word.  What we get are perfectly grammatical sentences and they match the pattern (4-2b):  他洗着澡 ta (he) xi (V) zhe (ZHE) zao (bath): ‘He is taking a bath’;  他洗过澡 ta (he) xi (V) guo (GUO) zao (bath): ‘He has taken the bath’.  Therefore the assumption is proven to be correct.  This way, all V+X combinations can be judged based on the judgment patterns (4-2a) or (4-2b).

The third method proposed below involves a more general expansion test.  As an atomic unit in syntax, the internal parts of a word are in principle not separable.[6]  Lü (1989) emphasized inseparability as a criterion for judging grammar words.  But he did not give instructions how this criterion should be applied.  Nevertheless, many linguists (e.g. Bloomfield 1933; Z. Lu 1957;  Lyons 1968; Dai 1993) have discussed expansion tests one way or another in assisting the wordhood judgment.

The method of expansion to be presented below for wordhood judgment is called X-insertion.  X-insertion is based on Di Sciullo and Williams’ thesis of the syntactic atomicity of word.  The rationale is that the internal parts of a word cannot be separated by syntactic constituents.

As a method, how to perform X-insertion is defined as follows.   Suppose that one needs to judge whether the combination A+B is a word.   If a sign X can be found to satisfy the following condition, then A+B is not a word, but a syntactic combination:  (i) A+X+B is a grammatical string,  (ii) X is not a bound morpheme, and (iii) the sub-structure [A+X] is headed by A or the sub-string [X+B] is headed by B.

The first constraint is self-evident:  a syntactic combination is necessarily a grammatical string.  The second constraint aims at  eliminating the danger of wrongly applying an infix here.  In fact, if X is a morphological infix, the conclusion would be just opposite:  A+B is a word.  The last constraint states that X must be a dependant of the head A (or B).  Otherwise, it results in a different structure.  There is no direct structural relation between A and B when A (or B) is a dependant of the head X in the structure.  Therefore, the question of whether A+B is a phrase or a word does not apply in the first place.

After the wordhood judgment is made on strings of signs based on the above judgment methods and/or the arguments for the analysis involved, the next step is to have them properly represented (coded) in the grammar formalism used.  This is the topic to be presented in 4.3 below.

4.3. Formal Representation of Word

The expectation feature structure and structural phrase structure in the mono-stratal design of CPSG95 presented in Chapter III provide means for the formal definition of the basic unit word in CPSG95.  Once the wordhood judgment for a unit is made based on arguments for a structural analysis and/or using the methods presented in Section 4.2., the formal representation is required for coding it in CPSG95.

This type of formalization is required to ensure its implementability in enforcing a required configurational constraint.  For example, the suffix 性 -xing expects an adjective word to form an abstract noun, such constraints [CATEGORY a] and @word can be placed in the morphological expectation feature [SUFFIXING].  These constraints will permit, for example, the legitimately derived word 严肃性 [yan-su]-xing] (serious-ness), but will block the following combination * 非常严肃性 [[fei-chang yan-su]-xing] (very-serious-ness).  This is because 非常严肃 [fei-chang yan-su] violates the formal constraint as given in the word definition:  it is not an atomic unit in syntax.

In CPSG95, word is defined as a syntactically atomic unit without obligatory morphological expectations, formally represented in the following macro.

word macro
PREFIXING saturated | optional
SUFFIXING saturated | optional
STRUCT no_syn_dtr

Note that the above formal definition uses the sorted hierarchy [struct] for the structural feature structure and the sorted hierarchy [expected] for the expectation feature structure.  The definitions of these feature structures have been given in the preceding Chapter III.

Based on the sorted hierarchy struct: {syn_dtr, no_syn_dtr}, the constraint [no_syn_dtr] ensures that the word sign do not contain any syntactic daughter.[7]  This prevents syntactic constructions from being treated as words.  On the other hand, since [saturated], [obligatory] and [optional] are three subtypes of [expected], the constraint [saturated|optional] prevents a bound morpheme, say a prefix or suffix which has obligatory expectation in [PREFIXING] or [SUFFIXING], from being treated as a word.

This macro definition covers the representation of mono-morpheme words, e.g. 鹅 e ‘goose’, 读 du ‘read’, etc., or multi-morpheme words, e.g. 小看 xiao-kan ‘look down upon’, 天鹅 tian-e ‘swan’, etc., as well as unlisted words such as derived words whose internal morphological structures have already been formed.  Some typical examples of word are shown below.



For a derived word, note that the specification of [PREFIXING satisfied] and [STRUCT prefix], or [SUFFIXING satisfied] and [STRUCT suffix], assigned by the corresponding PS rule is compatible with the macro word definition.

The above word definition is an extension of the corresponding representation features from HPSG (Pollard and Sag 1987).  HPSG uses a binary structural feature [LEX] to distinguish lexical signs, [LEX +], and non-lexical signs, [LEX -].  In addition, [sign] is divided into [lexical_sign] and [phrasal_sign].[8]  Except for the one-to-one correspondence between [phrasal_sign] and [syn_dtr] in terms of rank (which stands for non-atomic syntactic constructs including phrases), neither of these HPSG binary divisions account for the distinction between a bound morpheme and a free morpheme.  Such a distinction is not necessary in HPSG because bound morphemes are assumed to be processed in the preprocessing stage (e.g. lexical rules for English inflection, Pollard and Sag 1987) and do not show themselves as independent input to the parser.  As CPSG95 involves both derivation morphology and syntax in an integrated general grammar, the HPSG binary divisions are no longer sufficient for formalizing the word definition.  ‘Word’ in CPSG95 needs to be distinguished with proper constraints from not only syntactic constructs, but also from affixes (bound morphemes).

In CPSG95, as productive derivation is designed to be an integrated component of the grammar, the word definition is both specified in the lexicon for some free morpheme words and assigned by the rules in morphological analysis.  This practice in essence follows one  suggestion in the original HPSG book:  “we might divide rules of grammar into two classes: rules of word formation, including compounding rules, which introduce the specification [LEX +] on the mother, and other rules, which introduce [LEX -] on the mother.” (Pollard and Sag 1987:73).

It is worth noticing that words thus defined can fill either a morphological position or a syntactic position.  This reflects the interface nature of word:  word is an eligible unit in both morphology and syntax.  This is in contrast to bound morphemes which can only be internal parts of morphology.

In morphology, derivation combines a word and an affix into a derived word.  These derivatives are eligible to feed morphology again.   This is shown above by the examples in (4-5) and (4-6).  The adjective word 可读 ke-du (read-able) is derived from the prefix morpheme 可 ke- (-able) and the word 读 du (read).  Like other adjective words, this derived word can further combine with the suffix 性
–xing (-ness) in morphology.  It can also directly enter syntax, as all words do.

To syntax, all words are atomic units.  If a lexical position is specified, via the macro constraint @word in CPSG95, in a syntactic pattern, it makes no difference whether a filler of this position is a listed grammar word, or an unlisted word such as a derivative.  Such distinction is transparent to the syntactic structure.

4.4. Summary

Efforts have been made to reach a better understanding of Chinese wordhood in theory, methodology and formalization.  The main spirit of the HPSG theory and Di Sciullo and Williams’ ‘syntactic atomicity’ theory has been applied to the study of Chinese wordhood and its formal representation.  Some effective wordhood judgment methods have also been proposed, based on theoretical guidelines.

The above work in the area of Chinese wordhood study provides a sound foundation for the analysis of the specific Chinese morpho-syntactic interface problems in Chapter V and Chapter VI.




[1] For Classical Chinese, word, morpheme, syllable and hanzi are presumably all co-extensive.  This is the so-called Monosyllabic Myth of Chinese (DeFrancis 1984: ch.8).  The development of large numbers of homophones, mainly due to the loss of coda stops, has led to the development of large quantities of bi-syllabic and poly-syllabic word-like expressions (Chen and Wang 1975).

[2] Classical Chinese arguably allows for a certain degree of compounding.  In the linguistic literature, some linguists (e.g. Sapir 1921; Zhang 1957; Jensen 1990) did not strictly distinguish Contemporay/Modern Chinese from Classical Chinese and they held the general view that Chinese has little morphology except for limited compounding.  But this view of Contemporary Chinese has been criticized as misconception (Dai 1993) and is no longer accepted by the community of Chinese grammarians.

[3] Di Sciullo and Williams call a sign listable in the lexicon listeme, equivalent to the notion vocabulary word.

[4] In the literature, variations of  this view include the Lexicalist position (Chomsky 1970), the Lexical Integrity Hypothesis (Jackendoff 1972), the Principle of Morphology-Free Syntax (Zwicky 1987), etc.

[5] This type of ‘atomicity’ constraint (Di Sciullo and Williams 1987) is generally known as Lexical Integrity Hypothesis (LIH, Jackendoff 1972), which states that syntactic rules or operations cannot refer to part of a word.  A more elaborate version of LIH is proposed by Zwicky (1987) as a Principle of Morphology-Free Syntax.  This principle states that syntactic rules cannot make reference to the internal morphological composition of words.  The only lexical properties accessible to syntax, according to Zwicky, are syntactic category, subcategory, and features like gender, case, person, etc.

[6] Of course, in theory a word may be separated by morphological infix.  But except for the two modal signs de3 (can) and bu (cannot) (see Section 5.3 in Chapter V), there does not seem to exist infixation in Mandarin Chinese.

[7] In terms of rank, [no_syn_dtr] in CPSG95 corresponds to the type [lexical_sign] in HPSG (Pollard and Sag 1987).  A binary division between [lexical_sign] and [phrasal_sign] is enough in HPSG to distinguish the atomic unit word from syntactic construction.  But, as CPSG95 incorporates derivation in the general grammar, [no_syn_dtr] covers for both free morphemes and bound morphemes.  That is why the [no_syn_dtr] constraint on [STRUCT] alone cannot define word in CPSG95;  it needs to involve constraints on morphological expectation structures as well, as shown in the macro definition.

[8] Note that there are [LEX -] signs which are not of the type [phrasal_sign].



PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

PhD Thesis: Chapter I Introduction

PhD Thesis: Chapter II Role of Grammar

PhD Thesis: Chapter III Design of CPSG95

PhD Thesis: Chapter IV Defining the Chinese Word

PhD Thesis: Chapter V Chinese Separable Verbs

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

PhD Thesis: Chapter VII Concluding Remarks

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

PhD Thesis: Chapter III Design of CPSG95

3.0. Introduction

CPSG95 is the grammar designed to formalize the morpho-syntactic analysis presented in this dissertation.  This chapter presents the general design of CPSG95 with emphasis on three essential aspects related to the morpho-syntactic interface:  (i) the overall mono-stratal design of the sign;  (ii) the design of expectation feature structures;  (iii) the design of structural feature structures.

The HPSG-style mono-stratal design of the sign in CPSG95 provides a general framework for the information flow between different components of a grammar via unification.  Morphology, syntax and semantics are all accommodated in distinct features of a sign.  An example will be shown to illustrate the information flow between these components.

Expectation feature structures are designed to accommodate lexical information for the structural combination.  Expectation feature structures are vital to a lexicalized grammar like CPSG95.  The formal definition for the sort hierarchy [expected] for the expectation features will be given.  It will be demonstrated that the defined sort hierarchy provides means for imposing a proper structural hierarchy as defined by the general grammar.

One characteristic of the CPSG95 structural expectation is the unique design of morphological expectation features to incorporate Chinese productive derivation.  This design is believed to be a feasible and natural way of modeling Chinese derivation, as shall be presented shortly below and elaborated in section 3.2.1.  How this design benefits the interface coordination between derivation and syntax will be further demonstrated in Chapter VI.

The type [expected] for the expectation features is similar to the HPSG definition of [subcat] and [mod].  They both accommodate lexical expectation information to drive the analysis conducted via the general grammar.  In order to meet some requirements induced by introducing morphology into the general grammar and by accommodating linguistic characteristics of Chinese, three major modifications from the standard HPSG are proposed in CPSG95.  They are:  (i) the CPSG95 type [expected] is more generalized as to cover productive derivation in addition to syntactic subcategorization and modification;  (ii) unlike HPSG which tries to capture word order phenomena as independent constraints, Chinese word order in CPSG95 is integrated in the definition of the expectation features and the corresponding morphological/syntactic relations;  (iii) in terms of handling the syntactic subcategorization, CPSG95 pursues a non-list alternative to the standard practice of HPSG relying on the list design of obliqueness hierarchy.  The rationale and arguments for these modifications are presented in the corresponding sections, with a brief summary given below.

The first modification is necessitated by meeting the needs of introducing Chinese productive derivation into the grammar.  It is observed that a Chinese affix acts as the head daughter of the derivative in terms of expectation (Dai 1993).  The expectation information that drives the analysis of a Chinese productive derivation is found to be capturable lexically by the affix sign;  this is very similar to how the information for the head-driven syntactic analysis is captured in HPSG.  The expansion of the expectation notion to include productive morphology can account for a wider range of linguistic phenomena.  The feasibility of this modification has been verified by the implementation of CPSG95 based on the generalized expectation feature structures.

One outstanding characteristic of all the expectation features designed in CPSG95 is that the word order information is implied in the definition of these features.[1]  Word order constraints in CPSG95 are captured by individual PS rules for the structural relationship between the constituents.  In other words, Chinese word order constraints are not treated as phenomena which have sufficient generalizations of themselves independent of the individual morphological or syntactic relations.  This is very different from the word order treatment in theories like HPSG (Pollard and Sag 1987) and GPSG (Gazdar, Klein, Pullum and Sag 1985).  However, a similar treatment can be found in the work from  the school of ‘categorial grammar’ (e.g. Dowty 1982).

The word order theory in HPSG and GPSG is based on the assumption that structural relations and syntactic roles can be defined without involving the factor of word order.  In other words, it is assumed that the structural nature of a constituent (subject, object, etc.) and its linear position in the related structures can be studied separately.  This assumption is found to be inappropriate in capturing Chinese structural relations.  So far, no one has been able to propose an operational definition for Chinese structural relations and morphological/syntactic roles without bringing in word order.[2]

As Ding (1953) points out, without the means of inflections and case markers, word order is a primary constraint for defining and distinguishing Chinese structural relations.[3]  In terms of expectation, it can always be lexically decided where for the head sign to look for its expected daughter(s).  It is thus natural to design the expectation features directly on their expected word order.

The reason for the non-list design in capturing Chinese subcategorization can be summarized as follows:  (i) there has been no successful attempt by anyone, including the initial effort involved in the CPSG95 experiment, which demonstrates that the obliqueness design can be applied to Chinese grammar with sufficient linguistic generalizations;  (ii) it is found that the atomic approach with separate features for each complement is a feasible and flexible proposal in representing the relevant linguistic phenomena.

Finally, the design of the structural feature [STRUCT]  originates from [LEX + | -] in HPSG (Pollard and Sag 1987).  Unlike the binary type for [LEX], the type [struct] for [STRUCT] forms an elaborate sort hierarchy.  This is designed to meet the configurational requirements of introducing morphology into CPSG95.  This feature structure, together with the design of expectation feature structures, will help create a favorable framework for handling Chinese morpho-syntactic interface.  The proposed structural feature structure and the expectation feature structures contribute to the formal definition of linguistic units in CPSG95.  Such definitions enable proper lexical configurational constraints to be imposed on the expected signs when required.

3.1. Mono-stratal Design of Sign

This section presents the data structure involving the interface between morphology, syntax and semantics in CPSG95.  This is done by defining the mono-stratal design of the fundamental notion sign and by illustrating how different components, represented by the distinct features for the sign, interact.

As a dynamic unit of grammatical analysis, a sign can be a morpheme, a word, a phrase or a sentence.  It is the most fundamental object of HPSG-style grammars.  Formally, a sign is defined in CPSG95 by the type [a_sign], as shown below.[4]

(3-1.) Definition: a_sign

HANZI                            hanzi_list
CONTENT                      content
CATEGORY                    category
SUBJ                               expected
COMP0_LEFT               expected
COMP1_RIGHT             expected
COMP2_RIGHT             expected
MOD_LEFT                    expected
MOD_RIGHT                  expected
PREFIXING                    expected
SUFFIXING                    expected
STRUCT                          struct

The type [a_sign] introduces a set of linguistic features for the description of a sign.  These are features for orthography, morphology, syntax and semantics, etc.[5]  The types, which are eligible to be the values of these features, have their own definitions in the sort hierarchy.  An introduction of these features follows.

The orthographic feature [HANZI] contains a list of Chinese characters (hanzi or kanji).  The feature [CONTENT] embodies the semantic representation of the sign.  [CATEGORY] carries values like [n] for noun, [v] for verb, [a] for adjective, [p] for preposition, etc.  The structural feature [STRUCT] contains information on the relation of the structure to its sub-constituents, to be presented in detail in section 3.3.

The features whose appropriate value must be the type [expected] are called expectation features.  They are the essential part of a lexicalist grammar as these features contain information about various types of potential structures in both syntax and morphology.  They specify various constraints on the expected daughter(s) of a sign for structural analysis.   The design of these expectation features and their appropriate type [expected] will be presented shortly in section 3.2.

The definition of [a_sign] illustrates the HPSG philosophy of mono-stratal analysis interleaving different components.  As seen, different components of Chinese grammar are contained in different feature structures for the general linguistic unit sign.  Their interaction is effected via the unification of relevant feature structures during various stages of analysis.  This will unfold as the solutions to the morpho-syntactic interface problems are presented in Chapter V and Chapter VI.  For illustration, the prefix 可 ke (-able) is used as an example in the following discussion.

As is known, the prefix ke– (-able) makes an adjective out of a transitive verb:  ke- + Vt –> A.  This lexicalized rule is contained in the CPSG95 entry for the prefix ke-, shown in (3-2).  Following the ALE notation, @ is used for macro, a shorthand mechanism for a pre-defined feature structure.[6]


As seen, the prefix ke- morphologically expects a sign with [CATEGORY vt].  An affix is analyzed as the head of a derivational structure in CPSG95 (see section 6.1 for discussion) and [CATEGORY] is a representative head feature to be percolated up to the mother sign via the corresponding morphological PS rule as formulated in (6-4) of section 6.2, this expectation eventually leads to a derived word with [CATEGORY a].  Like most Chinese adjectives, the derived adjective has an optional expectation for a subject NP to account for sentences like 这本书很可读 zhe (this) ben (CLA) shu (book) hen (very) ke-du (read-able): ‘This book is very readable’.  This syntactic optional expectation for the derivative is accommodated in the head feature [SUBJ].

Note that before any structural combination of ke- with other expected signs, ke- is a bound morpheme, a sign which has obligatory morphological expectation in [PREFIXING].  As a head for both the morphological combination ke+Vt and the potential syntactic combination NP+[ke+Vt], the interface between morphology and syntax in this case lies in the hierarchical structures which should be imposed.   That is, the morphological structure (derivation) should be established before its syntactic expected structure can be realized.  Such a configurational constraint is specified in the corresponding PS rules, i.e. the Subject PS Rule and The Prefix PS Rule.  It guarantees that the obligatory morphological expectation of ke- has to be saturated before the sign can be legitimately used in syntactic combination.

The interaction between morphology/syntax and semantics in this case is encoded by the information flow, i.e. structure-sharing indicated by the number index in square brackets, between the corresponding feature structures inside this sign.  The semantic compositionality involved in the morphological and syntactic grouping is represented like this.  There is a semantic predicate marked as [-able] (for worthiness) in the content feature [RELN];  this predicate has an argument which is co-indexed by [1] with the semantics of the expected Vt.  Note that the syntactic subject of the derived adjective, say ke-du (read-able) or ke-chi (eat-able), is the semantic (or logical) object of the stem verb, co-indexed by [2] in the sample entry above.  The head feature [CONTENT] which reflects the semantic compositionality will be percolated up to the mother sign when applicable morphological and syntactic PS rules take effect in structure building.

In summary, embodied in CPSG95 is a mono-stratal grammar of morphology and syntax within the same formalism.  Both morphology and syntax use same data structure (typed feature structure) and mechanisms (unification, sort hierarchy, PS rules, lexical rules, macros).   This design for Chinese grammar is original and is shown to be feasible in the CPSG95 experiments on various Chinese constructions.  The advantages of handling morpho-syntactic interface problems under this design will be demonstrated throughout this dissertation.

3.2. Expectation Feature Structures

This section presents the design of the expectation features in CPSG95.  In general, the expectation features contain information about various types of potential structures of the sign.  In CPSG95, various constraints on the expected daughter(s) of a sign are specified in the lexicon to drive both morphological and syntactic structural analysis.  This provides a favorable basis for interleaving Chinese morphology and syntax in analysis.

The expected daughter in CPSG95 is defined as one of the following grammatical constituents:  (i) subject in the feature [SUBJ];  (ii) first complement in the feature [COMP0_LEFT] or [COMP1_RIGHT];  (iii) second complement in [COMP2_RIGHT];   (iv) head of a modifier in the feature [MOD_LEFT] or [MOD_RIGHT];   (v) stem of an affix in the feature [PREFIXING] or [SUFFIXING].[7]  The first four are syntactic daughters which will be investigated in sections 3.2.2 and 3.2.3.  The last one is the morphological daughter for affixation, to be presented in section 3.2.1.  All these features are defined on the basis of the relative word order of the constituents in the structure.  The hierarchy for the structure at issue resorts to the configurational constraints which will be presented in section 3.2.4.

3.2.1. Morphological Expectation

One key characteristic of the CPSG95 expectation features is the design of morphological expectation features to incorporate Chinese productive derivation.

It is observed that a Chinese affix acts as the head daughter of the derivative in terms of expectation (see section 6.1 for more discussion).   An affix can lexically define what stem to expect and can predict the derivation structure to be built.  For example, the suffix 性 –xing demands that it combine with a preceding adjective to make an abstract noun, i.e. A+-xing –> N.  This type of information can be easily captured by the expectation feature structure in the lexicon, following the practice of the HPSG treatment of the syntactic expectation such as subcategorization and modification.

In the CPSG95 lexicon, each affix entry is encoded to provide the following derivation information:   (i) what type of stem it expects;  (ii) whether it is a prefix or suffix to decide where to look for the expected stem;  (iii) what type of (derived) word it produces.  Based on this lexical information, the general grammar only needs to include two PS rules for Chinese derivation:  one for prefixation, one for suffixation.  These rules will be formulated in Chapter VI (sections 6.2 and 6.3).  It will also be demonstrated that this lexicalist design for Chinese derivation works for both typical cases of affixation and for some difficult cases such as ‘quasi-affixation’ and zhe-suffixation.

In summary, the morphological combination for productive derivation in CPSG95 is designed to be handled by only two PS rules in the general grammar, based on the lexical specification in [PREFIXING] and [SUFFIXING].  Essentially, in CPSG95, productive derivation is treated like a ‘mini-syntax’;[8]  it becomes an integrated part of Chinese structural analysis.

3.2.2. Syntactic Expectation

This section presents the design of the expectation features to represent Chinese syntactic relations.  It will be demonstrated that constraints like word order and function words are crucial to the formalization of syntactic relations.  Based on them, four types of syntactic relations can be defined, which are accommodated in six syntactic expectation feature structures for each head word.

There is no general agreement on how to define Chinese syntactic relations.  In particular, the distinction between Chinese subject and object has been a long debated topic (e.g. Ding 1953; L. Li 1986, 1990; Zhu 1985; Lü 1989).  The major difficulty lies in the fact that Chinese does not have inflection to indicate subject-verb agreement and nominative case or accusative case, etc.

Theory-internally, there have been various proposals that Chinese syntactic relations be defined on the basis of one or more of the following factors:  (i) word order (more precisely, constituent order);  (ii) the function words associated with the constituents;  (iii) the semantic relations or roles.  The first two factors are linguistic forms while the third factor belongs to linguistic content.

L. Li (1986, 1990) relies mainly on the third factor to study Chinese verb patterns. The constituents in his proposal are named as NP-agent (ming-shi), NP-patient (ming-shou), etc. This practice amounts to placing an equal sign between the syntactic relation and semantic relation.  It implies that the syntactic relation is not an independent feature.  This makes syntactic generalization difficult.

Other Chinese grammarians (e.g. Ding 1953; Zhu 1985) emphasize the factor of word order in defining syntactic relations.  This school insists that syntactic relations be differentiated from semantic relations.  More precisely, semantic relations should be the result of the analysis of syntactic relations.  That is also the rationale behind the CPSG95 practice of using word order and other constraints (including function words) in the definition of Chinese relations.

In CPSG95, the expected syntactic daughter in CPSG95 is defined as one of the following grammatical constituents:  (i) subject in the feature [SUBJ], typically an NP which is on the leftmost position relative to the head;  (ii) complements closer to the head in the feature [COMP0_LEFT] or [COMP1_RIGHT], in the form of an NP or a specific PP;  (iii) the second complement in [COMP2_RIGHT]: this complement is defined to be an XP (NP, a specific PP, VP, AP, etc.) farther away from the head than [COMP1_RIGHT] in word order;  (iv) head of a modifier in the feature [MOD_LEFT] or [MOD_RIGHT].  In this defined framework of four types of possible syntactic relations, for each head word, the lexicon is expected to specify the specific constraints in its corresponding expectation feature structures and map the syntactic constituents to the corresponding semantic roles in [CONTENT].  This is a secure way of linking syntactic structures and their semantic composition for the following reason.  Given a specific head word and a syntactic structure with its various constraints specified in the expectation feature structures, the decoding of semantics is guaranteed.[9]

A Chinese syntactic pattern can usually be defined by constraints from category, word order, and/or function words (W. Li 1996).  For example, NP+V, NP+V+NP, NP+PP(x)+NP, NP+V+NP+NP, NP+V+NP+VP, etc.  are all  such patterns.  With the design of the expectation features presented above, these patterns can be easily formulated in the lexicon under the relevant head entry, as demonstrated by the sample formulations given in (3-3) and (3-4).



The structure in (3-3) is a Chinese transitive pattern in its default word order, namely NP1+Vt+NP2.  The representation in (3-4) is another transitive pattern NP+PP(x)+Vt.  This pattern requires a particular preposition x to introduce its object before the head verb.

The sample entry in (3-5) is an example of how modification is represented in CPSG95.  Following the HPSG semantics principle, the semantic content from the modifier will be percolated up to the mother sign from the head-modifier structure via the corresponding PS rule.  The added semantic contribution of the adverb chang-chang (often) is its specification of the feature [FREQUENCY] for the event at issue.


3.2.3. Chinese Subcategorization

This section presents the rationale behind the CPSG95 design for subcategorization.  Instead of a SUBCAT-list, a keyword approach with separate features for each complement is chosen for representing the subcategorization information, as shown in the corresponding expectation features in section 3.2.2.  This design has been found to be a feasible alternative to the standard practice of HPSG relying on the list design of obliqueness hierarchy and SUBCAT Principle when handling subject and complements.

The CPSG95 design for representing subcategorization follows one proposal from Pollard and Sag (1987:121), who point out:  “It may be possible to develop a hybrid theory that uses the keyword approach to subjects, objects and other complements, but which uses other means to impose a hierarchical structure on syntactic elements, including optional modifiers not subcategorized for in the same sense.”  There are two issues for such a hybrid theory:  the keyword approach to representing subject and complements and the means for imposing a hierarchical structure.  The former is discussed below while the latter will be addressed in the subsequent section 3.2.4.

The basic reason for abandoning the list design is due to the lack of an operational definition of obliqueness which captures generalizations of Chinese subcategorization.  In the English version of HPSG (Pollard and Sag 1987, 1994), the obliqueness ordering is established between the syntactic notions of subject, direct object and second object (or oblique object).[10]  But these syntactic relations themselves are by no means universal.  In order to apply this concept to the Chinese language, there is a need for an operational definition of obliqueness which can be applied to Chinese syntactic relations.  Such a definition has not been available.

In fact, how to define Chinese subject, object and other complements has been one of the central debated topics among Chinese grammarians for decades (Lü 1946, 1989; Ding 1953; L. Li 1986, 1990; Zhu 1985; P. Chen 1994).  No general agreement for an operational, cross-theory definition of Chinese subcategorization has been reached.  It is often the case that formal or informal definitions of Chinese subcategorization are given within a theory or grammar.   But so far no Chinese syntactic relations defined in a theory are found to demonstrate convincing advantages of a possible obliqueness ordering, i.e. capturing the various syntactic generalizations for Chinese.

Technically, however, as long as subject and complements are formally defined in a theory, one can impose an ordering of them in a SUBCAT list.  But if such a list does not capture significant generalizations, there is no point in doing so.[11]  It has turned out that the keyword approach is a promising alternative once proper means are developed for the required configurational constraint on structure building.

The keyword approach is realized in CPSG95 as follows.  Syntactic constituents for subcategorization, namely subject and complements, are directly accommodated in four parallel features [SUBJ], [COMP0_LEFT], [COMP1_RIGHT] and [COMP2_RIGHT].

The feasibility of the keyword approach proposed here has been tested during the implementation of CPSG95 in representing a variety of structures.  Particular attention has been given to the constructions or patterns related to Chinese subcategorization.  They include various transitive structures, di-transitive structures, pivotal construction (jianyu-shi), ba-construction (ba-zi ju), various passive constructions (bei-dong shi), etc.  It is found to be easy to  accommodate all these structures in the defined framework consisting of the four features.

We give a couple of typical examples below, in addition to the ones in (3-3) and (3-4) formulated before, to show how various subcategorization phenomena are accommodated in the CPSG95 lexicon within the defined feature structures for subcategorization.  The expected structure and example are shown before each sample formulation in (3‑6) through (3-8) (with irrelevant implementation details left out).



Based on such lexical information, the desirable hierarchical structure on the related syntactic elements, e.g. [S [V O]] instead of [[S V] O], can be imposed via the configurational constraint based on the design of the expectation type.  This is presented in section 3.2.4 below.

3.2.4. Configurational Constraint

The means for the configurational constraint to impose a desirable hierarchical morpho-syntactic structure defined by a grammar is the key to the success of a keyword approach to structural constituents, including subject and complements from the subcategorization.  This section defines the sort hierarchy of the expectation type [expected].  The use of this design for flexible configurational constraint both in the general grammar and in the lexicon will be demonstrated.

As presented before, whether a sign has structural expectation, and what type of expectation a sign has, can be lexically decided:  they form the basis for a lexicalized grammar.  Four basic cases for  expectation are distinguished in the expectation type of CPSG95:  (i) obligatory: the expected sign must occur;  (ii) optional:  the expected sign may occur;  (iii) null:  no expectation;  (iv) satisfied: the expected sign has occurred.  Note that case (i), case (ii) and case (iii) are static information while (iv) is dynamic information, updated at the time when the daughters are combined into a mother sign.  In other words, case (iv) is only possible when the expected structure has actually been built.  In HPSG-style grammars, only the general grammar, i.e. the set of PS rules, has the power of building structures.  For each structure being built, the general grammar will set [satisfied] to the corresponding expectation feature of the mother sign.

Out of the four types, case (i) and case (ii) form a natural class,  named as [a_expected];  case (iii) and case (iv) are of one class named as [saturated].  The formal definition of the type [expected] is given (3-9].

(3-9.) Definition: sorted hierarchy for [expected]

expected: {a_expected, saturated}
a_expected: {obligatory, optional}
ROLE role
SIGN a_sign
saturated: {null, satisfied}

The type [a_expected] introduces two features:  [ROLE] and [SIGN].   [ROLE] specifies the semantic role which the expected sign plays in the structure.  [SIGN] houses various types of constraints on the expected sign.

The type [expected] is designed to meet the requirement of the configurational constraint.  For example, in order to guarantee that syntactic structures for an expecting sign are built on top of its morphological structures if the sign has obligatory morphological expectation, the following configurational constraint is enforced in the general grammar.  (The notation | is used for logical OR.)

(3-10.)         configurational constraint in syntactic PS rules

PREFIXING                    saturated | optional
SUFFIXING                    saturated | optional

The constraint [saturated] means that syntactic rules are permitted to apply if a sign has no morphological expectation or after the morphological expectation has been satisfied.  The reason why the case [optional] does not block the application of syntactic rules is the following.  Optional expectation entails that the expected sign may or may not appear.  It does not have to be satisfied.

Similarly, within syntax, the constraints can be specified in the Subject PS Rule:

(3-11.)         configurational constraint in Subject PS rule

COMP0_LEFT                 saturated | optional
COMP1_RIGHT              saturated | optional
COMP2                           saturated | optional

This ensures that complement rules apply before the subject rule does.  This way of imposing a hierarchical structure between subcategorized elements corresponds to the use of SUBCAT Principle in HPSG based on the notion of obliqueness.

The configurational constraint is also used in CPSG95 for the formal definition of phrase, as formulated below.

phrase macro

PREFIXING saturated | optional
SUFFIXING saturated | optional
COMP1_LEFT saturated | optional
COMP1_RIGHT saturated | optional
COMP2 saturated | optional

Despite the notational difference, this definition follows the spirit reflected in the phrase definition given in Pollard and Sag (1987:69) in terms of the saturation status of the subcategorized complements.  In essence, the above definition says that a phrase is a sign whose morphological expectation and syntactic complement expectation (except for subject) are both saturated.  The reason to include [optional] in the definition is to cover phrases whose head daughter has optional expectation, for example, a verb phrase consisting of just a verb with its optional object omitted in the text.

Together with the design of the structural feature [STRUCT] (section 3.3), the sort hierarchy of the type [expected] will also enable the formal definition for the representation of the fundamental notion word (see Section 4.3 in Chapter IV).  Definitions such as @word and @phrase are the basis for lexical configurational constraints to be imposed on the expected signs when required.  For example, -xing (-ness) will expect an adjective stem with the word constraint and -zhe (-er) can impose the phrase constraint on the expected verb sign based on the analysis proposed in section 6.5.

3.3. Structural Feature Structure

The design of the feature [STRUCT] serves important structural purposes in the formalization of the CPSG95 interface between morphology and syntax.  It is necessary to present the rationale of this design and the sort hierarchy of the type [struct] used in this feature.

The design of [STRUCT struct] originates from the binary structural feature structure [LEX + | -] in the original HPSG theory (Pollard and Sag 1987).  However, in the CPSG95 definition, the type [struct] forms an elaborate sort hierarchy.   It is divided into two types at the top level:  [syn_dtr] and [no_syn_dtr].  A sub-type of [no_syn_dtr] is [no_dtr].  The CPSG95 lexicon encodes the feature [STRUCT no_dtr] for all single morphemes.[12]  Another sub-type of [no_syn_dtr] is [affix] (for units formed via affixation) which is further sub-typed into [prefix] and [suffix], assigned by the Prefix PS rule and Suffix PS Rule.  In syntax, [syn_dtr] includes sub-types like [subj], [comp] and [mod].  Despite the hierarchical depth of the type, it is organized to follow the natural classification of the structural relation involved.  The formal definition is given below.

(3-12.)         Definition: sorted hierarchy for [struct]

struct: {syn_dtr, no_syn_dtr}
syn_dtr: {subj, comp, mod}
comp: {comp0_left, comp1_right, comp2_right}
mod: {mod_left, mod_right}
no_syn_dtr: {no_dtr, affix}
affix: {prefix, suffix}

In CPSG95, [STRUCT] is not a (head) feature which percolates up to the mother sign;  its value is solely decided by the structure being built.[13]   Each PS rule, whether syntactic or morphological, assigns the value of the [STRUCT] feature for the mother sign, according to the nature of combination.  When morpheme daughters are combined into a mother sign word, the value of the feature [STRUCT] for the mother sign remains a sub-type of [no_syn_dtr].  But when some syntactic rules are applied, the rules will assign the value to the mother sign as a sub-type of [syn_dtr] to show that the structure being built is a syntactic construction.

The design of the feature structure [STRUCT struct] is motivated by the new requirement caused by introducing morphology into the general grammar of  CPSG95.  In HPSG, a simple, binary type for [LEX] is sufficient to distinguish lexical signs, i.e. [LEX +], from signs created via syntactic rules, i.e. [LEX -].  But in CPSG95, as presented in section 3.2.1 before, productive derivation is also accommodated in the general grammar.  A simple distinction between a lexical sign and a syntactic sign cannot capture the difference between signs created via morphological rules and signs created via syntactic rules.  This difference plays an essential role in formalizing the morpho-syntactic interface, as shown below.

The following examples demonstrate the structural representation through the design of the feature [STRUCT].  In the CPSG95 lexicon, the single Chinese characters like the prefix ke- (-able) and the free morphemes du (read), bao (newspaper) are all coded as [STRUCT no_dtr].   When the Prefix PS Rule combines the prefix ke- and the verb du into an adjective ke-du, the rule assigns [STRUCT prefix] to the newly built derivative.  The structure may remain in the domain of morphology as the value [prefix] is a sub-type of [no_syn_dtr].  However, when this structure is further combined with a subject, say, bao (newspaper) by the syntactic Subj PS Rule, the resulting structure [bao [ke-du]] (‘Newspapers are readable’) is syntactic, having [STRUCT subj] assigned by the Subj PS Rule;  in fact, this is a simple sentence.  Similarly, the syntactic Comp1_right PS Rules can combine the transitive verb du (read) and the object bao (newspaper) and assign for the unit du bao (read newspapers) in the feature [STRUCT comp1_right].  In general, when signs whose [STRUCT] value is a sub-type of [no_syn_dtr] combine into a unit whose [STRUCT] is assigned a sub-type of [syn_dtr], it marks the jump from the domain of morphology to syntax.  This is the way the interface of Chinese morphology and syntax is formalized in the present formalism.

The use of this feature structure in the definition of Chinese word will be presented in Chapter IV.  Further advantages and flexibility of the design of this structural feature structure and the expectation feature structures will be demonstrated in later chapters in presenting solutions to some long-standing problems at the morpho-syntactic interface.

3.4. Summary

The major design issues for the proposed mono-stratal Chinese grammar CPSG95 are addressed.  This provides a framework and means for formalizing the analysis of the linguistic problems at the morpho-syntactic interface.  It has been shown that the design of the CPSG95 expectation structures enables configuration constraints to be imposed on the structure hierarchy defined by the grammar.  This makes the keyword approach to Chinese subcategorization a feasible alternative to the list design based on the obliqueness hierarchy of subject and complements.

Within this defined framework of CPSG95, the subsequent Chapter IV will be able to formulate the system-internal, but strictly formalized definition of Chinese word.  Formal definitions such as @word and @phrase enable proper configurational constraints to be imposed on the expected signs when required.  This lays a foundation for implementing the proposed solutions to the morpho-syntactic interface problems to be explored in the remaining chapters.



[1] More precisely, it is not ‘word’ order, it is constituent order, or linear precedence (LP) constraint between constituents.

[2]  L. Li (1986, 1990)’s definition on structural constituents does not involve word order.  However, his proposed definition is not an operational one from the angle of natural language processing.  He relies on the decoding of the semantic roles for the definitions of the proposed constituents like NP-agent (ming-shi), NP-patient (ming-shou), etc.  Nevertheless, his proposal has been reported to produce good results in the field of Chinese language teaching.  This seems to be understandable because the process of decoding semantic roles is naturally and subconsciously conducted in the mind of the language instructors/learners.

[3] Most linguists agree that Chinese has no inflectional morphology (e.g. Hockett 1958; Li and Thompson 1981; Zwicky 1987; Sun and Cole 1991).  The few linguists who believe that Chinese has developed or is developing inflection morphology include  Bauer (1988) and Dai (1993).  Typical examples cited as Chinese inflection morphemes are aspect markers le, zhe, guo and the plural marker men.

[4] A note for the notation: uppercase is used for feature and lowercase, for type.

[5] Phonology and discourse are not yet included in the definition.  The latter is a complicated area which requires further research before it can be properly integrated in the grammar analysis.  The former is not necessary because the object for CPSG95 is Written Chinese.   In the few cases where phonology affects structural analysis, e.g. some structural expectation needs to check the match of number of syllables, one can place such a constraint indirectly by checking the number of Chinese characters instead (as we know, a syllable roughly corresponds to a Chinese character or hanzi).

[6] The macro constraint @np in (3-2) is defined to be [CATEGORY n] and a call to another macro constraint @phrase to be defined shortly in Section 3.2.4.

[7] These expectation features defined for [a_sign] are a maximum set of possible expected daughters;  any specific sign may only activate a subset of them, represented by non-null value.

[8] This is similar to viewing morphology as ‘the syntax of words’ (Selkirk 1982; Lieber 1992; Krieger 1994).  It seems that at least affixation shares with syntax similar structural constraints on constituency and linear ordering in Chinese.  The same type of mechanisms (PS rules, typed feature structure for expectation, etc) can be used to capture both Chinese affixation and syntax (see Chapter VI).

[9] More precisely, the decoding of possible ways of semantic composition is guaranteed.  Syntactically ambiguous structures with the same constraints correspond to multiple ways of semantic compositionality.  These are expressed as different entries in the lexicon and the link between these entries is via corresponding lexical rules, following the HPSG practice. (W. Li 1996)

[10]  Borsley (1987) has proposed an HPSG framework where subject is posited as a distinct feature than other complements.  Pollard and Sag (1994:345) point out that “the overwhelming weight of evidence favors Borsley’s view of this matter”.

[11] The only possible benefit of such arrangement is that one can continue using the SUBCAT Principle for building complement structure via list cancellation.

[12] It also includes idioms whose internal morphological structure is unknown or has no grammatical relevance.

[13] The reader might have noticed that the assigned value is the same as the name of the PS rule which applies.  This is because there is correspondence between what type of structure is being built and what PS rule is building it.  Thus, the [STRUCT] feature actually records the rule application information.  For example, [STRUCT subj] reflects the fact that the Subj PS Rule is the most recently applied rule to the structure in point;  a structure built via the Prefix PS Rule has [STRUCT prefix] in place; etc.  This practice gives an extra benefit of the functionality of ‘tracing’ which rules have been applied in the process of debugging the grammar.  If there has never been a rule applied to a sign, it must be a morpheme carrying [STRUCT no_dtr] from the lexicon.



PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

PhD Thesis: Chapter I Introduction

PhD Thesis: Chapter II Role of Grammar

PhD Thesis: Chapter III Design of CPSG95

PhD Thesis: Chapter IV Defining the Chinese Word

PhD Thesis: Chapter V Chinese Separable Verbs

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

PhD Thesis: Chapter VII Concluding Remarks

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP



PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)


The Morpho-syntactic Interface in a Chinese Phrase Structure Grammar



Wei Li

B.A., Anqing Normal College, China, 1982

M.A., The Graduate School of Chinese Academy of

Social Sciences, China, 1986



Thesis submitted in partial fulfillment of

the requirements for the degree of



in the Department



Morpho-syntactic Interface in a Chinese Phrase Structure Grammar


Wei Li 2000


November 2000



All rights reserved. This work may not be reproduced in whole or in part, by photocopy or other means, without permission of the author.



Name:                         Wei Li

Degree:                       Ph.D.

Title of thesis:             THE MORPHO-SYNTACTIC INTERFACE IN



(Approved January 12, 2001)



This dissertation examines issues related to the morpho-syntactic interface in Chinese, specifically those issues related to the following long-standing problems in Chinese Natural Language Processing (NLP): (i) disambiguation in Chinese word identification;  (ii) Chinese productive word formation;  (iii) borderline phenomena between morphology and syntax, such as Chinese separable verbs and ‘quasi-affixation’.

All these problems pose challenges to an independent Chinese morphology system or separate word segmenter.  It is argued that there is a need to bring in the syntactic analysis in handling these problems.

To enable syntactic analysis in addition to morphological analysis in an integrated system, it is necessary to develop a Chinese grammar that is capable of representing sufficient information from both morphology and syntax.  The dissertation presents the design of such a Chinese phrase structure grammar, named CPSG95 (for Chinese Phrase Structure Grammar).  The unique feature of CPSG95 is its incorporation of Chinese morphology in the framework of Head-Driven Phrase Structure Grammar.  The interface between morphology and syntax is then defined system internally in CPSG95 and uniformly represented using the underlying grammar formalism used by the Attribute Logic Engine.  For each problem, arguments are presented for the proposed analysis to capture the linguistic generality;  morphological or syntactic solutions are formulated based on the analysis.  This provides a secure approach to solving problems at the interface of Chinese morphology and syntax.


To my daughter Tian Tian

whose babbling accompanied and inspired the writing of this work

And to my most devoted friend Dr. Jianjun Wang

whose help and advice encouraged me to complete this work


First and foremost, I feel profoundly grateful to Dr. Paul McFetridge, my senior supervisor.  It was his support that brought me to SFU and the beautiful city Vancouver, which changed my life.  Over the years,  he introduced me into the HPSG study, and provided me with his own parser for testing grammar writing.  His mentorship and guidance have influenced my research fundamentally.  He critiqued my research experiments and thesis writing in many facets, from the development of key ideas, selection of topics, methodology, implementation details to writing and presentation style.  I feel guilty for not being able to promptly understand and follow his guidance at times.

I would like to thank Dr. Fred Popowich, my second advisor.  He has given me both general academic guidance on research methodology and numerous specific comments for the thesis revision which have helped shape the present version of the thesis as it is today.

I am also grateful to Dr. Nancy Hedberg from whom I have taken four graduate courses, including the course of HPSG.  I have not only learned a lot from her lectures in the classroom, but have benefited greatly from our numerous discussions on general linguistic topics as well as issues in Chinese linguistics.

Thanks to Davide Turkato, my friend and colleague in the Natural Language Lab.  He is always there whenever I need help.  We have also shared many happy hours in our common circle of Esperanto club in Vancouver.

I would like to thank Dr. Ping Xue, Dr. Zita McRobbie, Dr. Thomas Perry, Dr. Donna Gerdts and Dr. Richard DeArmond for the courses I have taken from them.  These courses were an important part of my linguistic training at SFU.

For various help and encouragement I have got during my graduate studies, I should also thank all the faculty, staff and colleagues of the linguistics department and the Natural Language Lab of SFU, in particular, Rita, Sheilagh, Dr. Ross Saunders, Dr. Wyn Roberts, Dr. Murray Munro and Dr. Olivier Laurens.  I am particularly thankful to Carol Jackson, our Graduate Secretary for her years of help.  She is remarkable, very caring and responsive.

I would like to extend my thanks to all my fellow students and friends in the linguistics department of SFU, in particular, Dr. Trude Heift, Dr. Janine Toole, Susan Russel, Dr. Baoning Fu, Zhongying Lu, Dr. Shuicai Zhou, Jianyi Yu, Jean Wang, Cliff Burgess and Kyoung-Ja Lee.  We have had so much fun together and have had many interesting discussions, both academic and non-academic.  Today, most of us have graduated, some are professors or professionals in different universities or institutions.  Our linguistics department is not big, but it is such a nice department where faculty, staff and the graduate student body form a very sociable community.  I have truly enjoyed my graduate life in this department.

Beyond SFU, I would like to thank Dr. De-Kang Lin for the insightful discussion on the possibilities of integrated Chinese parsing back in 1995.  Thanks to Gerald Penn, one of the authors of ALE, for providing the powerful tool ALE and for giving me instructions on modifying some functions in ALE to accommodate some needs for Chinese parsing during my experiment in implementing a Chinese grammar.

I am also grateful to Dr. Rohini Srihari, my current industry supervisor, for giving me an opportunity to manage NLP projects for real world applications at Cymfony.  This industrial experience has helped me to broaden my NLP knowledge, especially in the area of statistical NLP and the area of shallow parsing using Finite State Transducers.

Thanks to Carrie Pine and Walter Gadz from US Air Force Research Laboratory who have been project managers for the Small Business Innovation Research (SBIR) efforts ‘A Domain Independent Event Extraction Toolkit’ (Phase II), ‘Flexible Information Extraction Learning Algorithm’ (Phase I and Phase II) and ‘Intermediate-Level Event Extraction for Temporal and Spatial Analysis and Visualization’ (Phase I and Phase II).  I have been Principal Investigator for these government funded efforts at Cymfony Inc. and have had frequent and extremely beneficial contact with them.  With these projects, I have had an opportunity to apply the skills and knowledge I have acquired from my Ph.D. program at SFU.

My professional training at SFU was made possible by a grant that Dr. Paul McFetridge and Dr. Nick Cercone applied for.  The work reported in this thesis was supported in the later stage  by a Science Council of B.C. (CANADA) G.R.E.A.T. award.  I am grateful to both my academic advisor Paul McFetridge and my industry advisor John Grayson, CEO of TCC Communications Corporation of Victoria, for assisting me in obtaining this prestigious grant.

I would not have been able to start and continue my research career without many previous helps I got from various sources, agencies and people in the last 15 years, for which I owe a big prayer of thanks.

I owe a great deal to Prof. Zhuo Liu and Prof. Yongquan Liu for leading me into the NLP area and supervising my master program in computational linguistics at CASS (Chinese Academy of Social Sciences, 1983-1986).  Their guidance in both research ideas and implementation details benefited me for life.  I am grateful to my former colleagues Prof. Aiping Fu, Prof. Zhenghui Xiong and Prof. Linding Li at the Institute of Linguistics of CASS for many insightful discussions on issues involving NLP and Chinese grammars.  Thanks also go to Ms. Fang Yang and the machine translation team at Gaoli Software Co. in Beijing for the very constructive and fruitful collaborative research and development work.  Our collaboration ultimately resulted in the commercialization of the GLMT English-to-Chinese machine translation system.

Thanks to Dr. Klaus Schubert, Dr. Dan Maxwell and Dr. Victor Sadler from BSO (Utrecht, The Netherlands) for giving me the project of writing a computational grammar of Chinese dependency syntax in 1988.  They gave me a lot of encouragement and guidance in the course of writing the grammar.  This work enabled me to study Chinese grammar in a formal and systematic way.  I have carried over this formal study of Chinese grammar to the work reported in this thesis.

I am also thankful to the Education Ministry of China, Sir Pao Foundation and British Council for providing me with the prestigious Sino-British Friendship Scholarship.  This scholarship enabled me to study computational linguistics at Centre for Computational Linguistics, UMIST, England (1992).  During my stay in UMIST, I had opportunities to attend lectures given by Prof. Jun-ichi Tsujii, Prof. Harold Somers and Dr. Paul Bennett.  I feel grateful to all of them for their guidance in and beyond the classroom.  In particular, I must thank Dr. Paul Bennett for his supervision, help and care.

I would like to thank Prof. Dong Zhen Dong and Dr. Lua Kim Teng for inviting and sponsoring me for a presentation at ICCC’96 in Singapore.  They are the leading researchers in the area of Chinese NLP.  I have benefited greatly from the academic contact and communication with them.

Thanks to anonymous reviewers of the international journals of  Communications of COLIPS, Journal of Chinese Information Processing, World Science and Technology and grkg/Humankybernetik.  Thanks also to reviewers of the International Conference on Chinese Computing (ICCC’96), North American Conference on Chinese Linguistics (NACCL‑9), Applied Natural Language Conference (ANLP’2000), Text Retrieval Conference (TREC-8), Machine Translation SUMMIT II, Conference of the Pacific Association for Computational Linguistics (PACLING-II) and North West Linguistics Conferences (NWLC).  These journals and conferences have provided a forum for publishing the NLP-related research work I and my colleagues have undertaken at different times of my research career.

Thanks to Dr. Jin Guo who has developed his influential theory of tokenization.  I have benefited enormously from exchanging ideas with him on tokenization and Chinese NLP.

In terms of research methodology and personal advice, I owe a great deal to my most devoted friend Dr. Jianjun Wang, Associate Professor at California State University, Bakersfield, and Fellow of the National Center for Education Statistics in US.  Although in a totally different discipline, there has never been an obstacle for him to understand the basic problem I was facing and to offer me professional advice.  At times when I was puzzled and confused, his guidance often helped me to quickly sort things out.  Without his advice and encouragement, I would not have been able to complete this thesis.

Finally, I wish to thank my family for their support.  All my family members, including my parents, brothers and sisters in China, have been so supportive and understanding.  In particular, my father has been encouraging me all the time.  When I went through hardships  in my pursuit,  he shared the same burden;  when I had some achievement,  he was as happy as I was.

I am especially grateful to my wife, Chunxi.  Without her love, understanding and support, it is impossible for me to complete this thesis.  I wish I had done a better job to have kept her less worried and frustrated.  I should thank my four-year-old daughter, Tian Tian.  I feel sorry for not being able to spend more time with her.  What has supported me all these years is the idea that some day she will understand that as a first-generation immigrant, her dad has managed to overcome various challenges in order to create a better environment for her to grow.


Approval                    ii

Abstract                    iii

Dedication                    iv

Acknowledgments                  v

Chapter I  Introduction                1

1.0. Foreword                1

1.1. Background                2

  • Principle of Maximum Tokenization and Critical Tokenization            2
  • Monotonicity Principle and Task-driven Segmentation            5

1.2. Morpho-syntactic Interface Problems        8

1.2.1. Segmentation ambiguity        8

1.2.2. Productive Word Formation        10

1.2.3. Borderline Cases between Morphology and Syntax              11

1.3. CPSG95:  HPSG-style Chinese Grammar in ALE    13

1.3.1. Background and Overview of CPSG95    14

1.3.2. Illustration              15

1.4. Organization of the Dissertation          16

Chapter II  Role of Grammar              18

2.0. Introduction                18

2.1. Segmentation Ambiguity and Syntax        19

2.1.1. Resolution of Hidden Ambiguity      19

2.1.2. Resolution of Overlapping Ambiguity    24

2.2. Productive Word Formation and Syntax      33

2.3. Borderline Cases and Grammar          37

2.4. Knowledge beyond Syntax            39

2.5. Summary                46

Chapter III  Design of CPSG95              48

3.0. Introduction                48

3.1. Mono-stratal Design of Sign          52

3.2. Expectation Feature Structures          57

3.2.1. Morphological Expectation        58

3.2.2. Syntactic Expectation          59

3.2.3. Chinese Subcategorization        63

3.2.4. Configurational Constraint        67

3.3. Structural Feature Structure          70

3.4. Summary                73

Chapter IV  Defining the Chinese Word          75

4.0. Introduction                75

4.1. Two Notions of Word            78

4.2. Judgment Methods              83

4.3. Formal Representation of Word          88

4.4. Summary                92

Chapter V  Chinese Separable Verbs            93

5.0. Introduction                93

5.1. Verb-object Idioms: V+N I            96

5.2. Verb-object Idioms: V+N II          107

5.3. Verb-modifier Idioms: V+A/V          116

5.4. Summary                122

Chapter VI  Morpho-syntactic Interface Involving Derivation    123

6.0. Introduction                123

6.1. General Approach to Derivation          125

6.2. Prefixation                127

6.3. Suffixation                130

6.4. Quasi-affixes                132

6.5. Suffix zhe (-er)              139

6.6. Summary                151

Chapter VII  Concluding Remarks            152

7.0. Summary                152

7.1. Contributions              154

7.2. Limitation                158

7.3. Final Notes                    159

BIBLIOGRAPHY                  161

APPENDIX I    Source Code of Implemented CPSG95      170

APPENDIX II  Source Code of Implemented CPSG95 Lexicon    208

APPENDIX III  Tested Results in Three Experiments Using CPSG95  229



PhD Thesis: Morpho-syntactic Interface in CPSG (cover page)

PhD Thesis: Chapter I Introduction

PhD Thesis: Chapter II Role of Grammar

PhD Thesis: Chapter III Design of CPSG95

PhD Thesis: Chapter IV Defining the Chinese Word

PhD Thesis: Chapter V Chinese Separable Verbs

PhD Thesis: Chapter VI Morpho-syntactic Interface Involving Derivation

PhD Thesis: Chapter VII Concluding Remarks

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

The mainstream sentiment approach simply breaks in front of social media

I have articulated this point in various previous posts or blogs before, but the world is so dominated by the mainstream that it does not seem to carry.  So let me make it simple to be understood:

The sentiment classification approach based on bag of words (BOW) model, so far the dominant approach in the mainstream for sentiment analysis, simply breaks in front of social media.  The major reason is simple: the social media posts are full of short messages which do not have the “keyword density” required by a classifier to make the proper sentiment decision.   Larger training sets cannot help this fundamental defect of the methodology.  The precision ceiling for this line of work in real-life social media is found to be 60%, far behind the widely acknowledged precision minimum 80% for a usable extraction system.  Trusting a machine learning classifier to perform social media sentiment mining is not much better than flipping a coin.

So let us get straight.  From now on, whoever claims the use of machine learning for social media mining of public opinions and sentiments is likely to be a trap (unless it is verified to have involved parsing of linguistic structures or patterns, which so far has never been heard of in practical systems based on machine learning).  Fancy visualizations may make the results of the mainstream approach look real and attractive but they are just not trustable at all.

Related Posts:

Why deep parsing rules instead of deep learning model for sentiment analysis?
Pros and Cons of Two Approaches: Machine Learning and Grammar Engineering
Coarse-grained vs. fine-grained sentiment analysis

Why deep parsing rules instead of deep learning model for sentiment analysis?


(1)    Learning does not work in short messages as short messages do not have enough data points (or keyword density) to support the statistical model trained by machine learning.  Social media is dominated by short messages.

(2)    With long messages, learning can do a fairly good job in coarse-grained sentiment classification of thumbs-up and thumbs-down, but it is not good at decoding the fine-grained sentiment analysis to answer why people like or dislike a topic or brand.  Such fine-grained insights are much more actionable and valuable than the simple classification of thumbs-up and thumbs-down.

We have experimented with and compared  both approaches to validate the above conclusions.  That is why we use deep parsing rules instead of a deep learning model to reach the industry-leading data quality we have for sentiment analysis.

We do use deep learning for other tasks such as logo and image processing.  But for sentiment analysis and information extraction from text, especially in processing social media, the deep parsing approach is a clear leader in data quality.



The mainstream sentiment approach simply breaks in front of social media

Coarse-grained vs. fine-grained sentiment analysis

Deep parsing is the key to natural language understanding 

Automated survey based on social media

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP


S. Bai: Natural Language Caterpillar Breaks through Chomsky’s Castle


Translator’s note:

This article written in Chinese by Prof. S. Bai is a wonderful piece of writing worthy of recommendation for all natural language scholars.  Prof. Bai’s critical study of Chomsky’s formal language theory with regards to natural language has reached a depth never seen before ever since Chomsky’s revolution in 50’s last century.  For decades with so many papers published by so many scholars who have studied Chomsky, this novel “caterpillar” theory still stands out and strikes me as an insight that offers a much clearer and deeper explanation for how natural language should be modeled in formalism, based on my decades of natural language parsing study and practice (in our practice, I call the caterpillar FSA++, an extension of regular grammar formalism adequate for multi-level natural language deep parsing).  For example, so many people have been trapped in Chomsky’s recursion theory and made endless futile efforts to attempt a linear or near-linear algorithm to handle the so-called recursive nature of natural language which is practically non-existent (see Chomsky’s Negative Impact).  There used to be heated debates  in computational linguistics on whether natural language is context-free or context-sensitive, or mildly sensitive as some scholars call it.  Such debates mechanically apply Chomsky’s formal language hierarchy to natural languages, trapped in metaphysical academic controversies, far from language facts and data.  In contrast, Prof. Bai’s original “caterpillar” theory presents a novel picture that provides insights in uncovering the true nature of natural languages.

S. Bai: Natural Language Caterpillar Breaks through Chomsky’s Castle

Tags: Chomsky Hierarchy, computational linguistics, Natural Language Processing, linear speed

This is a technology-savvy article, not to be fooled by the title seemingly about a bug story in some VIP’s castle.  If you are neither an NLP professional nor an NLP fan, you can stop here and do not need to continue the journey with me on this topic.

Chomsky’s Castle refers to the famous Chomsky Hierarchy in his formal language theory, built by the father of contemporary linguistics Noam Chomsky more than half a century ago.  According to this theory, the language castle is built with four enclosing walls.  The outmost wall is named Type-0, also called Phrase Structure Grammar, corresponding to a Turing machine.  The second wall is Type-1, or Context-sensitive Grammar (CSG), corresponding to a parsing device called linear bounded automaton  with time complexity known to be NP-complete.  The third wall is Type-2, or Context-free Grammar (CFG), corresponding to a  pushdown automaton, with a time complexity that is polynomial, somewhere between square and cubic in the size of the input sentence for the best asymptotic order measured in the worst case scenario.  The innermost wall is Type-3, or Regular Grammar, corresponding to deterministic finite state automata, with a linear time complexity.  The sketch of the 4-wall Chomsky Castle is illustrated below.

This castle of Chomsky has impacted generations of scholars, mainly along two lines.  The first line of impact can be called “the outward fear syndrome”.  Because the time complexity for the second wall (CSG) is NP-complete, anywhere therein and beyond becomes a Forbidden City before NP=P can be proved.  Thus, the pressure for parsing natural languages has to be all confined to within the third wall (CFG).  Everyone knows the natural language involves some context sensitivity,  but the computing device cannot hold it to be tractable once it is beyond the third wall of CFG.  So it has to be left out.

The second line of impact is called “the inward perfection syndrome”.  Following the initial success of using Type 2 grammar (CFG) comes a severe abuse of recursion.  When the number of recursive layers increases slightly, the acceptability of a sentence soon approximates to almost 0.  For example, “The person that hit Peter is John” looks fine,  but it starts sounding weird to hear “The person that hit Peter that met Tom is John”.  It becomes gibberish with sentences like “The person that hit Peter that met Tom that married Mary is John”.  In fact, the majority resources spent with regards to the parsing efficiency are associated with such abuse of recursion in coping with gibberish-like sentences, rarely seen in real life language.  For natural language processing to be practical,  pursuing the linear speed cannot be over emphasized.  If we reflect on the efficiency of the human language understanding process, the conclusion is certainly about the “linear speed” in accordance with the length of the speech input.  In fact, the abuse of recursion is most likely triggered by the “inward perfection syndrome”, for which we intend to cover every inch of the land within the third wall of CFG, even if it is an area piled up by gibberish or garbage.

In a sense, it can be said that one reason for the statistical approach to take over the rule-based approach for such a long time in the academia of natural language processing is just the combination effect of these two syndromes.  To overcome the effects of these syndromes, many researchers have made all kinds of efforts, to be reviewed below one by one.

Along the line of the outward fear syndrome, evidence against the context-freeness has been found in some constructions in Swiss-German.  Chinese has similar examples in expressing respective correspondence of conjoined items and their descriptions.  For example,   “张三、李四、王五的年龄分别是25岁、32岁、27岁,出生地分别是武汉、成都、苏州” (Zhang San, Li Si, Wang Wu’s age is respectively 25, 32, and 27, they were born respectively in Wuhan, Chengdu, Suzhou” ).  Here, the three named entities constitute a list of nouns.  The number of the conjoined list of entities cannot be predetermined, but although the respective descriptors about this list of nouns also vary in length, the key condition is that they need to correspond to the antecedent list of nouns one by one.  This respective correspondence is something beyond the expression power of the context-free formalism.  It needs to get out of the third wall.

As for overcoming “the inward perfection syndrome”, the pursuit of “linear speed” in the field of NLP has never stopped.  It ranges from allowing for the look-ahead mechanism in LR (k) grammar, to the cascaded finite state automata, to the probabilistic CFG parsers which are trained on a large treebank and eventually converted to an Ngram (n=>5) model.  It should also include RNN/LSTM for its unique pursuit for deep parsing from the statistical school.  All these efforts are striving for defining a subclass in Type-2 CFG that reaches linear speed efficiency yet still with adequate linguistic power.  In fact, all parsers that have survived after fighting the statistical methods are to some degree a result of overcoming “the inward perfection syndrome”, with certain success in linear speed pursuit while respecting linguistic principles.  The resulting restricted subclass, compared to the area within the original third wall CFG, is a greatly “squashed” land.

If we agree that everything in parsing should be based on real life natural language as the starting point and the ultimate landing point, it should be easy to see that the outward limited breakthrough and the inward massive compression should be the two sides of a coin.  We want to strive for a formalism that balances both sides.  In other words, our ideal natural language parsing formalism should look like a linguistic “caterpillar” breaking through the Chomsky walls in his castle, illustrated below:

It seems to me that such a “caterpillar” may have already been found by someone.  It will not take too long before we can confirm it.
Original article in Chinese from 《穿越乔家大院寻找“毛毛虫”
Translated by Dr. Wei Li




On Hand-crafted Myth of Knowledge Bottleneck

In my article “Pride and Prejudice of Main Stream“, the first myth listed as top 10 misconceptions in NLP is as follows:

[Hand-crafted Myth]  Rule-based system faces a knowledge bottleneck of hand-crafted development while a machine learning system involves automatic training (implying no knowledge bottleneck).

While there are numerous misconceptions on the old school of rule systems , this hand-crafted myth can be regarded as the source of all.  Just take a review of NLP papers, no matter what are the language phenomena being discussed, it’s almost cliche to cite a couple of old school work to demonstrate superiority of machine learning algorithms, and the reason for the attack only needs one sentence, to the effect that the hand-crafted rules lead to a system “difficult to develop” (or “difficult to scale up”, “with low efficiency”, “lacking robustness”, etc.), or simply rejecting it like this, “literature [1], [2] and [3] have tried to handle the problem in different aspects, but these systems are all hand-crafted”.  Once labeled with hand-crafting, one does not even need to discuss the effect and quality.  Hand-craft becomes the rule system’s “original sin”, the linguists crafting rules, therefore, become the community’s second-class citizens bearing the sin.

So what is wrong with hand-crafting or coding linguistic rules for computer processing of languages?  NLP development is software engineering.  From software engineering perspective, hand-crafting is programming while machine learning belongs to automatic programming.  Unless we assume that natural language is a special object whose processing can all be handled by systems automatically programmed or learned by machine learning algorithms, it does not make sense to reject or belittle the practice of coding linguistic rules for developing an NLP system.

For consumer products and arts, hand-craft is definitely a positive word: it represents quality or uniqueness and high value, a legit reason for good price. Why does it become a derogatory term in NLP?  The root cause is that in the field of NLP,  almost like some collective hypnosis hit in the community, people are intentionally or unintentionally lead to believe that machine learning is the only correct choice.  In other words, by criticizing, rejecting or disregarding hand-crafted rule systems, the underlying assumption is that machine learning is a panacea, universal and effective, always a preferred approach over the other school.

The fact of life is, in the face of the complexity of natural language, machine learning from data so far only surfaces the tip of an iceberg of the language monster (called low-hanging fruit by Church in K. Church: A Pendulum Swung Too Far), far from reaching the goal of a complete solution to language understanding and applications.  There is no basis to support that machine learning alone can solve all language problems, nor is there any evidence that machine learning necessarily leads to better quality than coding rules by domain specialists (e.g. computational grammarians).  Depending on the nature and depth of the NLP tasks, hand-crafted systems actually have more chances of performing better than machine learning, at least for non-trivial and deep level NLP tasks such as parsing, sentiment analysis and information extraction (we have tried and compared both approaches).  In fact, the only major reason why they are still there, having survived all the rejections from mainstream and still playing a role in industrial practical applications, is the superior data quality, for otherwise they cannot have been justified for industrial investments at all.

the “forgotten” school:  why is it still there? what does it have to offer? The key is the excellent data quality as advantage of a hand-crafted system, not only for precision, but high recall is achievable as well.
quote from On Recall of Grammar Engineering Systems

In the real world, NLP is applied research which eventually must land on the engineering of language applications where the results and quality are evaluated.  As an industry, software engineering has attracted many ingenious coding masters, each and every one of them gets recognized for their coding skills, including algorithm design and implementation expertise, which are hand-crafting by nature.   Have we ever heard of a star engineer gets criticized for his (manual) programming?  With NLP application also as part of software engineering, why should computational linguists coding linguistic rules receive so much criticism while engineers coding other applications get recognized for their hard work?  Is it because the NLP application is simpler than other applications?  On the contrary, many applications of natural language are more complex and difficult than other types of applications (e.g. graphics software, or word processing apps).  The likely reason to explain the different treatment between a general purpose programmer and a linguist knowledge engineer is that the big environment of software engineering does not involve as much prejudice while the small environment of NLP domain  is deeply biased, with belief that the automatic programming of an NLP system by machine learning can replace and outperform manual coding for all language projects.   For software engineering in general, (manual) programming is the norm and no one believes that programmers’ jobs can be replaced by automatic programming in any time foreseeable.  Automatic programming, a concept not rare in science fiction for visions like machines making machines, is currently only a research area, for very restricted low-level functions.  Rather than placing hope on automatic programming, software engineering as an industry has seen a significant progress on work of the development infrastructures, such as development environment and a rich library of functions to support efficient coding and debugging.  Maybe in the future one day, applications can use more and more of automated code to achieve simple modules, but the full automation of constructing any complex software project  is nowhere in sight.  By any standards, natural language parsing and understanding (beyond shallow level tasks such as classification, clustering or tagging)  is a type of complex tasks. Therefore, it is hard to expect machine learning as a manifestation of automatic programming to miraculously replace the manual code for all language applications.  The application value of hand-crafting a rule system will continue to exist and evolve for a long time, disregarded or not.

“Automatic” is a fancy word.  What a beautiful world it would be if all artificial intelligence and natural languages tasks could be accomplished by automatic machine learning from data.  There is, naturally, a high expectation and regard for machine learning breakthrough to help realize this dream of mankind.  All this should encourage machine learning experts to continue to innovate to demonstrate its potential, and should not be a reason for the pride and prejudice against a competitive school or other approaches.

Before we embark on further discussions on the so-called rule system’s knowledge bottleneck defect, it is worth mentioning that the word “automatic” refers to the system development, not to be confused with running the system.  At the application level, whether it is a machine-learned system or a manual system coded by domain programmers (linguists), the system is always run fully automatically, with no human interference.  Although this is an obvious fact for both types of systems, I have seen people get confused so to equate hand-crafted NLP system with manual or semi-automatic applications.

Is hand-crafting rules a knowledge bottleneck for its development?  Yes, there is no denying or a need to deny that.  The bottleneck is reflected in the system development cycle.  But keep in mind that this “bottleneck” is common to all large software engineering projects, it is a resources cost, not only introduced by NLP.  From this perspective, the knowledge bottleneck argument against hand-crafted system cannot really stand, unless it can be proved that machine learning can do all NLP equally well, free of knowledge bottleneck:  it might be not far from truth for some special low-level tasks, e.g. document classification and word clustering, but is definitely misleading or incorrect for NLP in general, a point to be discussed below in details shortly.

Here are the ballpark estimates based on our decades of NLP practice and experiences.  For shallow level NLP tasks (such as Named Entity tagging, Chinese segmentation), a rule approach needs at least three months of one linguist coding and debugging the rules, supported by at least half an engineer for tools support and platform maintenance, in order to come up with a decent system for initial release and running.  As for deep NLP tasks (such as deep parsing, deep sentiments beyond thumbs-up and thumbs-down classification), one should not expect a working engine to be built up without due resources that at least involve one computational linguist coding rules for one year, coupled with half an engineer for platform and tools support and half an engineer for independent QA (quality assurance) support.  Of course, the labor resources requirements vary according to the quality of the developers (especially the linguistic expertise of the knowledge engineers) and how well the infrastructures and development environment support linguistic development.  Also, the above estimates have not included the general costs, as applied to all software applications, e.g. the GUI development at app level and operations in running the developed engines.

Let us present the scene of the modern day rule-based system development.  A hand-crafted NLP rule system is based on compiled computational grammars which are nowadays often architected as an integrated pipeline of different modules from shallow processing up to deep processing.  A grammar is a set of linguistic rules encoded in some formalism, which is the core of a module intended to achieve a defined function in language processing, e.g. a module for shallow parsing may target noun phrase (NP) as its object for identification and chunking.  What happens in grammar engineering is not much different from other software engineering projects.  As knowledge engineer, a computational linguist codes a rule in an NLP-specific language, based on a development corpus.  The development is data-driven, each line of rule code goes through rigid unit tests and then regression tests before it is submitted as part of the updated system for independent QA to test and feedback.  The development is an iterative process and cycle where incremental enhancements on bug reports from QA and/or from the field (customers) serve as a necessary input and step towards better data quality over time.

Depending on the design of the architect, there are all types of information available for the linguist developer to use in crafting a rule’s conditions, e.g. a rule can check any elements of a pattern by enforcing conditions on (i) word or stem itself (i.e. string literal, in cases of capturing, say, idiomatic expressions), and/or (ii) POS (part-of-speech, such as noun, adjective, verb, preposition), (iii) and/or orthography features (e.g. initial upper case, mixed case, token with digits and dots), and/or (iv) morphology features (e.g. tense, aspect, person, number, case, etc. decoded by a previous morphology module), (v) and/or syntactic features (e.g. verb subcategory features such as intransitive, transitive, ditransitive), (vi) and/or lexical semantic features (e.g. human, animal, furniture, food, school, time, location, color, emotion).  There are almost infinite combinations of such conditions that can be enforced in rules’ patterns.  A linguist’s job is to code such conditions to maximize the benefits of capturing the target language phenomena, a balancing art in engineering through a process of trial and error.

Macroscopically speaking, the rule hand-crafting process is in its essence the same as programmers coding an application, only that linguists usually use a different, very high-level NLP-specific language, in a chosen or designed formalism appropriate for modeling natural language and framework on a platform that is geared towards facilitating NLP work.  Hard-coding NLP in a general purpose language like Java is not impossible for prototyping or a toy system.  But as natural language is known to be a complex monster, its processing calls for a special formalism (some form or extension of Chomsky’s formal language types) and an NLP-oriented language to help implement any non-toy systems that scale.  So linguists are trained on the scene of development to be knowledge programmers in hand-crafting linguistic rules.  In terms of different levels of languages used for coding, to an extent, it is similar to the contrast between programmers in old days and the modern software engineers today who use so-called high-level languages like Java or C to code.  Decades ago, programmers had to use assembly or machine language to code a function.  The process and workflow for hand-crafting linguistic rules are just like any software engineers in their daily coding practice, except that the language designed for linguists is so high-level that linguistic developers can concentrate on linguistic challenges without having to worry about low-level technical details of memory allocation, garbage collection or pure code optimization for efficiency, which are taken care of by the NLP platform itself.  Everything else follows software development norms to ensure the development stay on track, including unit testing, baselines construction and monitoring, regressions testing, independent QA, code reviews for rules’ quality, etc.  Each level language has its own star engineer who masters the coding skills.  It sounds ridiculous to respect software engineers while belittling linguistic engineers only because the latter are hand-crafting linguistic code as knowledge resources.

The chief architect in this context plays the key role in building a real life robust NLP system that scales.  To deep-parse or process natural language, he/she needs to define and design the formalism and language with the necessary extensions, the related data structures, system architecture with the interaction of different levels of linguistic modules in mind (e.g. morpho-syntactic interface), workflow that integrate all components for internal coordination (including patching and handling interdependency and error propagation) and the external coordination with other modules or sub-systems including machine learning or off-shelf tools when needed or felt beneficial.  He also needs to ensure efficient development environment and to train new linguists into effective linguistic “coders” with engineering sense following software development norms (knowledge engineers are not trained by schools today).  Unlike the mainstream machine learning systems which are by nature robust and scalable,  hand-crafted systems’ robustness and scalability depend largely on the design and deep skills of the architect.  The architect defines the NLP platform with specs for its core engine compiler and runner, plus the debugger in a friendly development environment.  He must also work with product managers to turn their requirements into operational specs for linguistic development, in a process we call semantic grounding to applications from linguistic processing.   The success of a large NLP system based on hand-crafted rules is never a simple accumulation of linguistics resources such as computational lexicons and grammars using a fixed formalism (e.g. CFG) and algorithm (e.g. chart-parsing).  It calls for seasoned language engineering masters as architects for the system design.

Given the scene of practice for NLP development as describe above, it should be clear that the negative sentiment association with “hand-crafting” is unjustifiable and inappropriate.  The only remaining argument against coding rules by hands comes down to the hard work and costs associated with hand-crafted approach, so-called knowledge bottleneck in the rule-based systems.  If things can be learned by a machine without cost, why bother using costly linguistic labor?  Sounds like a reasonable argument until we examine this closely.  First, for this argument to stand, we need proof that machine learning indeed does not incur costs and has no or very little knowledge bottleneck.  Second, for this argument to withstand scrutiny, we should be convinced that machine learning can reach the same or better quality than hand-crafted rule approach.  Unfortunately, neither of these necessarily hold true.  Let us study them one by one.

As is known to all, any non-trivial NLP task is by nature based on linguistic knowledge, irrespective of what form the knowledge is learned or encoded.   Knowledge needs to be formalized in some form to support NLP, and machine learning is by no means immune to this knowledge resources requirement.  In rule-based systems, the knowledge is directly hand-coded by linguists and in case of (supervised) machine learning, knowledge resources take the form of labeled data for the learning algorithm to learn from (indeed, there is so-called unsupervised learning which needs no labeled data and is supposed to learn from raw data, but that is research-oriented and hardly practical for any non-trivial NLP, so we leave it aside for now).  Although the learning process is automatic,  the feature design, the learning algorithm implementation, debugging and fine-tuning are all manual, in addition to the requirement of manual labeling a large training corpus in advance (unless there is an existing labeled corpus available, which is rare; but machine translation is a nice exception as it has the benefit of using existing human translation as labeled aligned corpora for training).  The labeling of data is a very tedious manual job.   Note that the sparse data challenge represents the need of machine learning for a very large labeled corpus.  So it is clear that knowledge bottleneck takes different forms, but it is equally applicable to both approaches.  No machine can learn knowledge without costs, and it is incorrect to regard knowledge bottleneck as only a defect for the rule-based system.

One may argue that rules require expert skilled labor, while the labeling of data only requires high school kids or college students with minimal training.  So to do a fair comparison of the costs associated, we perhaps need to turn to Karl Marx whose “Das Kapital” has some formula to help convert simple labor to complex labor for exchange of equal value: for a given task with the same level of performance quality (assuming machine learning can reach the quality of professional expertise, which is not necessarily true), how much cheap labor needs to be used to label the required amount of training corpus to make it economically an advantage?  Something like that.  This varies from task to task and even from location to location (e.g. different minimal wage laws), of course.   But the key point here is that knowledge bottleneck challenges both approaches and it is not the case believed by many that machine learning learns a system automatically with no or little cost attached.  In fact, things are far more complicated than a simple yes or no in comparing the costs as costs need also to be calculated in a larger context of how many tasks need to be handled and how much underlying knowledge can be shared as reusable resources.  We will leave it to a separate writing for the elaboration of the point that when put into the context of developing multiple NLP applications, the rule-based approach which shares the core engine of parsing demonstrates a significant saving on knowledge costs than machine learning.

Let us step back and, for argument’s sake, accept that coding rules is indeed more costly than machine learning, so what? Like in any other commodities, hand-crafted products may indeed cost more, they also have better quality and value than products out of mass production.  For otherwise a commodity society will leave no room for craftsmen and their products to survive.  This is common sense, which also applies to NLP.  If not for better quality, no investors will fund any teams that can be replaced by machine learning.  What is surprising is that there are so many people, NLP experts included, who believe that machine learning necessarily performs better than hand-crafted systems not only in costs saved but also in quality achieved.  While there are low-level NLP tasks such as speech processing and document classification which are not experts’ forte as we human have much more restricted memory than computers do, deep NLP involves much more linguistic expertise and design than a simple concept of learning from corpora to expect superior data quality.

In summary, the hand-crafted rule defect is largely a misconception circling around wildly in NLP and reinforced by the mainstream, due to incomplete induction or ignorance of the scene of modern day rule development.  It is based on the incorrect assumption that machine learning necessarily handles all NLP tasks with same or better quality but less or no knowledge bottleneck, in comparison with systems based on hand-crafted rules.



Note: This is the author’s own translation, with adaptation, of part of our paper which originally appeared in Chinese in Communications of Chinese Computer Federation (CCCF), Issue 8, 2013



Domain portability myth in natural language processing

Pride and Prejudice of NLP Main Stream

K. Church: A Pendulum Swung Too Far, Linguistics issues in Language Technology, 2011; 6(5)

Wintner 2009. What Science Underlies Natural Language Engineering? Computational Linguistics, Volume 35, Number 4

Pros and Cons of Two Approaches: Machine Learning vs Grammar Engineering

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP

Pride and Prejudice of NLP Main Stream


In the area of Computational Linguistics, there are two basic approaches to natural language processing, the traditional rule system and the mainstream machine learning.  They are complementary and there are pros and cons associated with both.  However, as machine learning is the dominant mainstream philosophy reflected by the overwhelming ratio of papers published in academia, the area seems to be heavily biased against the rule system methodology.  The tremendous success of machine learning as applied to a list of natural language tasks has reinforced the mainstream pride and prejudice in favor of one and against the other.   As a result, there are numerous specious views which are often taken for granted without check, including attacks on the rule system’s defects based on incomplete induction or misconception.  This is not healthy for NLP itself as an applied research area and exerts an inappropriate influence on the young scientists coming to this area.  This is the first piece of a series of writings aimed at educating the public and confronting the prevalent prejudice, focused on the in-depth examination of the so-called hand-crafted defect of the rule system and the associated  knowledge bottleneck issue.

I. introduction

Over 20 years ago, the area of NLP (natural language processing) went through a process of replacing traditional rule-based systems by statistical machine learning as the mainstream in academia.  Put in a larger context of AI (Artificial Intelligence), this represents a classical competition, and their ups and downs, between the rational school and the empirical school (Church 2007 ).  It needs to be noted that the statistical approaches’ dominance in this area has its historical inevitability.   The old school was confined to toy systems or lab for too long without scientific break-through while machine learning started showing impressive results in numerous fronts of NLP in a much larger scale, initially  very low level NLP such as POS (Part-of-Speech) tagging and speech recognition / synthesis, and later on expanded to almost all NLP tasks, including machine translation, search and ranking, spam filtering, document classification, automatic summarization, lexicon acquisition, named entity tagging, relationship extraction, event classification, sentiment analysis.  This dominance has continued to grow till today when the other school is largely “out” from almost all major NLP arenas, journals and top conferences.  New graduates hardly realize its existence.  There is an entire generation gap for such academic training or carrying on the legacy of the old school, with exceptions of very few survivors (including yours truly) in industry because few professors are motivated to teach it at all or even qualified with an in-depth knowledge of this when the funding and publication prospects for the old school are getting more and more impossible.   To many people’s minds today, learning (or deep learning) is NLP, and NLP is learning, that is all.  As for the “last century’s technology” of rule-based systems, it is more like a failure tale from a distant history.

The pride and prejudice of the mainstream were demonstrated the most in the recent incidence when Google announced its deep-learning-based SyntaxNet and proudly claimed it to be “the most accurate parser in the world”, so resolute and no any conditions attached, and without even bothering to check the possible existence of the other school.  This is not healthy (and philosophically unbalanced too) for a broad area challenged by one of the most complex problems of mankind, i.e. to decode natural language understanding.  As there is only one voice heard, it is scaring to observe that the area is packed with prejudice and ignorance with regards to the other school, some from leaders of the area.  Specious comments are rampant and often taken for granted without check.

Prejudice is not a real concern as it is part of the real world around and involving ourselves, something to do with human nature and our innate limitation and ignorance. What is really scary is the degree and popularity of such prejudice represented in numerous misconceptions that can be picked up everywhere in this circle (I am not going to trace the sources of these as they are everywhere and people who are in this area for some time know this is not Quixote’s windmill but a reality reflection).  I will list below some of the myths or fallacies so deeply rooted in the area that they seem to become cliche, or part of the community consensus.  If one or more statements below sound familiar to you and they do not strike you as opinionated or specious which cannot withstand scrutiny, then you might want to give a second study of the issue to make sure we have not been subconsciously brain-washed.  The real damage is to our next generation, the new scholars coming to this field, who often do not get a chance for doubt.

For each such statement to be listed, it is not difficult to cite a poorly designed stereotypical rule system that falls short of the point, but the misconception lies in its generalization of associating an accused defect to the entire family of a school, ignorant of the variety of designs and the progress made in that school.

There are two types of misconceptions, one might be called myth and the other is sheer fallacy.  Myths arise as a result of “incomplete induction”.  Some may have observed or tried some old school rule systems of some sort, which show signs of the stated defect, then they jump to conclusions leading to the myths.  These myths call for in-depth examination and arguments to get the real picture of the truth.  As for fallacies, they are simply untrue.  It is quite a surprise, though, to see that even fallacies seem to be widely accepted as true by many, including some experts in this area.  All we need is to cite facts to prove them wrong.  For example, [Grammaticality Fallacy] says that the rule system can only parse grammatical text and cannot handle degraded text with grammar mistakes in it.  Facts speak louder than words: the sentiment engine we have developed for our main products is a parsing-supportedrule-based system that fully automatically extracts and mines public opinions and consumer insights from all types of social media, typical of degraded text.  Third-party evaluations show that this system is industry leader in data quality of sentiments, significantly better than competitions adopting machine learning. The large-scale operation of our system in the cloud in handling terabytes of real life social media big data (a year of social media in our index involve about 30 billion documents across more than 40 languages) also prove wrong what is stated in [Scalability Fallacy] below.

Let us now list these widely spread rumours collected from the community about the rule-based system to see if they ring the bell before we dive into the first two core myths to uncover the true picture behind in separate blogs.

II.  Top 10 Misconceptions against Rules

[Hand-crafted Myth]  Rule-based system faces a knowledge bottleneck of hand-crafted development while a machine learning system involves automatic training (implying no knowledge bottleneck). [see On Hand-crafted Myth of Knowledge Bottleneck.]

[Domain Portability Myth] The hand-crafted nature of a rule-based system leads to its poor domain portability as rules have to be rebuilt each time we shift to a new domain; but in case of machine learning, since the algorithm and system are universal, domain shift only involves new training data (implying strong domain portability). [see Domain Portability Myth]

[Fragility Myth]  A rule-based system is very fragile and it may break before unseen language data, so it cannot lead to a robust real life application.

[Weight Myth] Since there is no statistical weight associated with the results from a rule-based system, the data quality cannot be trusted with confidence.

[Complexity Myth] As a rule-based system is complex and intertwined, it is easy to get to a standstill, with little hope for further improvement.

[Scalability Fallacy]  The hand-crafted nature of a rule-based system makes it difficult to scale up for real life application; it is largely confined to the lab as a toy.

[Domain Restriction Fallacy]  A rule-based system only works in a narrow domain and it cannot work across domains.

[Grammaticality Fallacy] A rule-based system can only handle grammatical input in the formal text (such as news, manuals, weather broadcasts), it fails in front of degraded text involving misspellings and ungrammaticality such as social media, oral transcript, jargons or OCR output.

[Outdated Fallacy]  A rule-based system is a technology of last century, it is outdated (implying that it no longer works or can result in a quality system in modern days).

[Data Quality Fallacy]  Based on the data quality of results, a machine learning system is better than a rule based system. (cf: On Recall of Grammar Engineering Systems)

III.  Retrospect and Reflection of Mainstream

As mentioned earlier, a long list of misconceptions about the old school of rule-based systems have been around the mainstream for years in the field.   It may sound weird for an interdisciplinary field named Computational Linguistics to drift more and more distant from linguistics;  linguists play less and less a role in NLP dominated by statisticians today.  It seems widely assumed that with advanced deep learning algorithms, once data are available, a quality system will be trained without the need for linguistic design or domain expertise.

Not all main stream scholars are one-sided and near-sighted.  In recent years, insightful articles  (e.g., church 2007, Wintner 2009) began a serious retrospect and reflection process and called for the return of Linguistics: “In essence, linguistics is altogether missing in contemporary natural language engineering research. … I want to call for the return of linguistics to computational linguistics.”(Wintner 2009)Let us hope that their voice will not be completely muffled in this new wave of deep learning heat.

Note that the rule system which the linguists are good at crafting in industry is different from the classical linguistic study, it is formalized modeling of linguistic analysis.  For NLP tasks beyond shallow level, an effective rule system is not a simple accumulation of computational lexicons and grammars, but involves a linguistic processing strategy (or linguistic algorithm) for different levels of linguistic phenomena.  However, this line of study on the NLP platform design, system architecture and formalism has increasingly smaller space for academic discussion and publication, the research funding becomes almost impossible, as a result, the new generation faces the risk of a cut-off legacy, with a full generation of talent gap in academia.  Church (2007) points out that the statistical research is so dominant and one-sided that only one voice is now heard.  He is a visionary main stream scientist, deeply concerned about the imbalance of the two schools in NLP and AI.  He writes:

Part of the reason why we keep making the same mistakes, as Minsky and Papert mentioned above, has to do with teaching. One side of the debate is written out of the textbooks and forgotten, only to be revived/reinvented by the next generation.  ……

To prepare students for what might come after the low hanging fruit has been picked over, it would be good to provide today’s students with a broad education that makes room for many topics in Linguistics such as syntax, morphology, phonology, phonetics, historical linguistics and language universals. We are graduating Computational Linguistics students these days that have very deep knowledge of one particular narrow sub-area (such as machine learning and statistical machine translation) but may not have heard of Greenberg’s Universals, Raising, Equi, quantifier scope, gapping, island constraints and so on. We should make sure that students working on co-reference know about c-command and disjoint reference. When students present a paper at a Computational Linguistics conference, they should be expected to know the standard treatment of the topic in Formal Linguistics.

We ought to teach this debate to the next generation because it is likely that they will have to take Chomsky’s objections more seriously than we have. Our generation has been fortunate to have plenty of low hanging fruit to pick (the facts that can be captured with short ngrams), but the next generation will be less fortunate since most of those facts will have been pretty well picked over before they retire, and therefore, it is likely that they will have to address facts that go beyond the simplest ngram approximations.



About Author

Dr. Wei Li is currently Chief Scientist at Netbase Solutions in the Silicon Valley, leading the effort for the design and development of a multi-lingual sentiment mining system based on deep parsing.  A hands-on computational linguist with 30 years of professional experience in Natural Language Processing (NLP), Dr. Li has a track record of making NLP work robust. He has built three large-scale NLP systems, all transformed into real-life, globally distributed products.


Note: This is the author’s own translation, with adaptation, of our paper in Chinese which originally appeared in W. Li & T. Tang, “Pride and Prejudice of Main Stream:  Rule-based System vs. Machine Learning“, in Communications of Chinese Computer Federation (CCCF), Issue 8, 2013



K. Church: A Pendulum Swung Too Far, Linguistics issues in Language Technology, 2011; 6(5)

Wintner 2009. What Science Underlies Natural Language Engineering? Computational Linguistics, Volume 35, Number 4

Domain portability myth in natural language processing

On Hand-crafted Myth and Knowledge Bottleneck

On Recall of Grammar Engineering Systems

Pros and Cons of Two Approaches: Machine Learning vs Grammar Engineering

It is untrue that Google SyntaxNet is the “world’s most accurate parser”

R. Srihari, W Li, C. Niu, T. Cornell: InfoXtract: A Customizable Intermediate Level Information Extraction Engine. Journal of Natural Language Engineering, 12(4), 1-37, 2006

Introduction of Netbase NLP Core Engine

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP


On Recall of Grammar Engineering Systems

After I showed the benchmarking results of SyntaxNet and our rule system based on grammar engineering, many people seem to be surprised by the fact that the rule system beats the newest deep-learning based parser in data quality.  I then got asked many questions, one question is:

Q: We know that rules crafted by linguists are good at precision, how about recall?

This question is worth a more in-depth discussion and serious answer because it touches the core of the viability of the “forgotten” school:  why is it still there? what does it have to offer? The key is the excellent data quality as advantage of a hand-crafted system, not only for precision, but high recall is achievable as well.

Before we elaborate, here was my quick answer to the above question:

  • Unlike precision, recall is not rules’ forte, but there are ways to enhance recall;
  • To enhance recall without precision compromise, one needs to develop more rules and organize the rules in a hierarchy, and organize grammars in a pipeline, so recall is a function of time;
  • To enhance recall with limited compromise in precision, one can fine-tune the rules to loosen conditions.

Let me address these points by presenting the scene of action for this linguistic art in its engineering craftsmanship.

A rule system is based on compiled computational grammars.  A grammar is a set of linguistic rules encoded in some formalism.  What happens in grammar engineering is not much different from other software engineering projects.  As knowledge engineer, a computational linguist codes a rule in a NLP-specific language, based on a development corpus.  The development is data-driven, each line of rule code goes through rigid unit tests and then regression tests before it is submitted as part of the updated system.  Depending on the design of the architect, there are all types of information available for the linguist developer to use in crafting a rule’s conditions, e.g. a rule can check any elements of a pattern by enforcing conditions on (i) word or stem itself (i.e. string literal, in cases of capturing, say, idiomatic expressions), and/or (ii) POS (part-of-speech, such as noun, adjective, verb, preposition), (iii) and/or orthography features (e.g. initial upper case, mixed case, token with digits and dots), and/or (iv) morphology features (e.g. tense, aspect, person, number, case, etc. decoded by a previous morphology module), (v) and/or syntactic features (e.g. verb subcategory features such as intransitive, transitive, ditransitive), (vi) and/or lexical semantic features (e.g. human, animal, furniture, food, school, time, location, color, emotion).  There are almost infinite combinations of such conditions that can be enforced in rules’ patterns.  A linguist’s job is to use such conditions to maximize the benefits of capturing the target language phenomena, through a process of trial and error.

Given the description of grammar engineering as above, what we expect to see in the initial stage of grammar development is a system precision-oriented by nature.  Each rule developed is geared towards a target linguistic phenomenon based on the data observed in the development corpus: conditions can be as tight as one wants them to be, ensuring precision.  But no single rule or a small set of rules can cover all the phenomena.  So the recall is low in the beginning stage.  Let us push things to extreme, if a rule system is based on only one grammar consisting of only one rule, it is not difficult to quickly develop a system with 100% precision but very poor recall.  But what is good of a system that is precise but without coverage?

So a linguist is trained to generalize.  In fact, most linguists are over-trained in school for theorizing and generalization before they get involved in software industrial development.  In my own experience in training new linguists into knowledge engineers, I often have to de-train this aspect of their education by enforcing strict procedures of data-driven and regression-free development.  As a result, the system will generalize only to the extent allowed to maintain a target precision, say 90% or above.

It is a balancing art.  Experienced linguists are better than new graduates.  Out of  explosive possibilities of conditions, one will only test some most likely combination of conditions based on linguistic knowledge and judgement in order to reach the desired precision with maximized recall of the target phenomena.  For a given rule, it is always possible to increase recall at compromise of precision by dropping some conditions or replacing a strict condition by a loose condition (e.g. checking a feature instead of literal, or checking a general feature such as noun instead of a narrow feature such as human).  When a rule is fine-tuned with proper conditions for the desired balance of precision and recall, the linguist developer moves on to try to come up with another rule to cover more space of the target phenomena.

So, as the development goes on, and more data from the development corpus are brought to the attention on the developer’s radar, more rules are developed to cover more and more phenomena, much like silkworms eating mulberry leaves.  This is incremental enhancement fairly typical of software development cycles for new releases.  Most of the time, newly developed rules will overlap with existing rules, but their logical OR points to an enlarged conquered territory.  It is hard work, but recall gradually, and naturally, picks up with time while maintaining precision until it hits long tail with diminishing returns.

There are two caveats which are worth discussing for people who are curious about this “seasoned” school of grammar engineering.

First, not all rules are equal.  A non-toy rule system often provides mechanism to help organize rules in a hierarchy for better quality as well as easier maintenance: after all, a grammar hard to understand and difficult to maintain has little prospect for debugging and incremental enhancement.  Typically, a grammar has some general rules at the top which serve as default and cover the majority of phenomena well but make mistakes in the exceptions which are not rare in natural language.  As is known to all, naturally language is such a monster that almost no rules are without exceptions.  Remember in high school grammar class, our teacher used to teach us grammar rules.  For example, one rule says that a bare verb cannot be used as predicate with third person singular subject, which should agree with the predicate in person and number by adding -s to the verb: hence, She leaves instead of *She leave.  But soon we found exceptions in sentences like The teacher demanded that she leave.  This exception to the original rule only occurs in object clauses following certain main clause verbs such as demand, theoretically labeled  by linguists as subjunctive mood.  This more restricted rule needs to work with the more general rule to result in a better formulated grammar.

Likewise, in building a computational grammar for automatic parsing or other NLP tasks, we need to handle a spectrum of rules with different degrees of generalizations in achieving good data quality for a balanced precision and recall.  Rather than adding more and more restrictions to make a general rule not to overkill the exceptions, it is more elegant and practical to organize the rules in a hierarchy so the general rules are only applied as default after more specific rules are tried, or equivalently, specific rules are applied to overturn or correct the results of general rules.  Thus, most real life formalisms are equipped with hierarchy mechanism to help linguists develop computational grammars to model the human linguistic capability in language analysis and understanding.

The second point that relates to the topic of recall of a rule system is so significant but often neglected that it cannot be over-emphasized and it calls for a separate writing in itself.  I will only present a concise conclusion here.  It relates to multiple levels of parsing that can significantly enhance recall for both parsing and parsing-supported NLP applications.  In a multi-level rule system, each level is one module of the system, involving a grammar.  Lower levels of grammars help build local structures (e.g. basic Noun Phrase), performing shallow parsing.  System thus designed are not only good for modularized engineering, but also great for recall because shallow parsing shortens the distance of words that hold syntactic relations (including long distance relations) and lower level linguistic constructions clear the way for generalization by high level rules in covering linguistic phenomena.

In summary, a parser based on grammar engineering can reach very high precision and there are proven effective ways of enhancing its recall.  High recall can be achieved if enough time and expertise are invested in its development.  In case of parsing, as shown by test results, our seasoned English parser is good at both precision (96% vs. SyntaxNet 94%) and recall (94% vs. SyntaxNet 95%, only 1 percentage point lower than SyntaxNet) in news genre, and with regards to social media, our parser is robust enough to beat SyntaxNet in both precision (89% vs. SyntaxNet 60%) and recall (72% vs. SyntaxNet 70%).



Is Google SyntaxNet Really the World’s Most Accurate Parser?

It is untrue that Google SyntaxNet is the “world’s most accurate parser”

R. Srihari, W Li, C. Niu, T. Cornell: InfoXtract: A Customizable Intermediate Level Information Extraction Engine. Journal of Natural Language Engineering, 12(4), 1-37, 2006

K. Church: A Pendulum Swung Too Far, Linguistics issues in Language Technology, 2011; 6(5)

Pros and Cons of Two Approaches: Machine Learning vs Grammar Engineering

Pride and Prejudice of NLP Main Stream

On Hand-crafted Myth and Knowledge Bottleneck

Domain portability myth in natural language processing

Introduction of Netbase NLP Core Engine

Overview of Natural Language Processing

Dr. Wei Li’s English Blog on NLP



Small talk: World’s No 0

A few weeks ago, I had a chat with my daughter who’s planning to study cs.
“Dad, how are things going?”
“Got a problem: Google announced SyntaxNet claimed to be world’s no 1.”
“Why a problem?”
“Well if they are no 1, where am I?”
“No 2?”
“No, I don’t know who is no 1, but I have never seen a system beating ours. I might just as well be no 0.”
“Brilliant, I like that! Then stay in no 0, and let others fight for no 1. ……. In my data structure, I always start with 0 any way.”