(translated by Google Translate, post-edited by myself)
For the natural language processing (NLP) and its applications, the system architecture is the core issue. In my blog ( OVERVIEW OF NATURAL LANGUAGE PROCESSING), I sketched four NLP system architecture diagrams, now to be presented one by one .
In my design philosophy, an NLP process is divided into four stages, from the core engine up to the applications, as reflected in the four diagrams. At the bottom is deep parsing, following the bottom-up processing of an automatic sentence analyzer. This work is the most difficult, but it is the foundation and enabling technology for vast majority of NLP systems.
The purpose of parsing is to structure unstructured text. Facing the ever-changing language, only when it is structured in some logical form can we formulate patterns for the information we like to extract to support applications. This principle of linguistics structures began to be the consensus in the linguistics community when Chomsky proposed the transformation from surface structure to deep structure in his linguistic revolution of 1957. A tree representing the logical form does not only involve arcs that express syntactic-semantic relationships, but also contain the nodes of words or phrases that carry various conceptual information. Despite the importance of such deep trees, generally they do not directly support an NLP product. They remain only the internal representation of the parsing system, as a result of language analysis and understanding before its semantic grouding to the applications as their core support.
The next layer after parsing is the extraction layer, as shown in the above diagram. Its input is the parse tree, and the output is the filled-in content of templates, similar to filling in a form: that is the information needed for the application, a pre-defined table (so to speak), so that the extraction system can fill in the blanks by the related words or phrases extracted from text based on parsing. This layer has gone from the original domain-independent parser into the application-oriented and product-demanded tasks.
It is worth emphasizing that the extraction layer is geared towards the domain-oriented semantic focus, while the previous parsing layer is domain-independent. Therefore, a good framework is to do a very thorough analysis of logic semantics in deep parsing, in order to reduce the burden of information extraction. With the depth of the analysis in the logical semantic structures to support the extraction, a rule at extraction layer is in essence equivalent to thousands of surface rules at linear text layer. This creates the conditions for the efficient porting to new domains based on the same core engine of parsing.
There are two types of extraction, one is the traditional information extraction (IE), the extraction of facts or objective information: named entities, the relationships between entities, and events involving entities (which can answer questions like “who did what when and where” and the like). This extraction of objective information is the core technology and foundation for the knowledge graph (nowadays such a hot area in industry). After completion of IE, the next layer of information fusion (IF) is aimed at constructing the knowledge graph. The other type of extraction is about subjective information, for example, the public opinion mining is based on this kind of extraction. What I have done over the past five years as my focus is along this line for fine-grained extraction of public opinions (not just sentiment classification, but also to explore the reasons behind the public opinions and sentiments to provide the insights basis for decision-making). This is one of the hardest tasks in NLP, much more difficult than IE for objective information. Extracted information is usually stored in a database. This provides huge textual mentions of information to feed the underlying mining layer.
Many people confuse information extraction and text mining, but, in fact, they are two levels of different tasks. Extraction faces each individual language tree, embodied in each sentence, in order to find the information we want. The mining, however, faces a corpus, or data sources as a whole, from the language forest for gathering statistically significant insights. In the information age, the biggest challenge we face is information overload, we have no way to exhaust the information ocean for the insights we need, therefore, we must use the computer to dig out the information from the ocean for the required critical intelligence to support different applications. Therefore, mining relies on natural statistics, without statistics, the information is still scattered across the corpus even if it is identified. There is a lot of redundancy in the extracted mentions of information, mining can integrate them into valuable insights.
Many NLP systems do not perform deep mining, instead, they simply use a query to search real-time from the extracted information index in the database and merge the retrieved information on-the-fly, presenting the top n results to the user. This is actually also mining, but it is a way of retrieval to achieve simple mining for directly supporting an application.
In order to do a good job of mining, there is a lot of work that can be done in this mining layer. Text mining not only improves the quality of existing extracted information pieces, moreover, it can also tap the hidden information, that is not explicitly expressed in the data sources, such as the causal relationship between events, or statistical trends of the public opinions or behaviours. This type of mining was first done in the traditional data mining applications as the traditional mining was aimed at structured data such as transaction records, making it easy to mine implicit associations (e.g., people who buy diapers often buy beer, this reflects the common behaviours of young fathers of the new-born, and such hidden association can be mined to optimize the layout and sales of goods). Nowadays, natural language is also structured thanks to deep parsing, hence data mining algorithms for hidden intelligence in the database can, in principle, also be applied to enhance the value of intelligence.
The fourth architectural diagram is the NLP application layer. In this layer, the results from parsing, extraction, and mining out of the unstructured text sources can be used to support a variety of NLP products and services, ranging from the QA (question answering) systems to the dynamic construction of the knowledge graph (this type of graph is visualized now in the Google search when we do a search for a star or VIP), from automatic polling of public opinions to customer intelligence about brands, from intelligent assistants (e.g. chatbots, Siri etc.) to automatic summarization and so on.
This is my overall presentation of the basic architecture of NLP and its applications, based on nearly 20 years of experiences in the industry to design and develop NLP products. About 18 years ago, I was presenting a similar diagram of the NLP architecture to the first venture investor who told us that this is a million dollar slide. The presentation here is a natural inheritance and extension from that diagram.
Here is the previously mentioned million-dollar slide story. Under the Clinton’s administration before the turn of the century, the United States went through a “great leap forward” of the Internet technology, known as Dot Com Bubble, a time of hot money pouring into the IT industry while all kinds of Internet startups were springing up. In such a situation, my boss decided to seek venture capital for the business expansion, and requested me to illustrate our prototype of the implemented natural language system for its introduction. I then drew the following three-tier structure of an NLP system diagram: the bottom layer is parsing, from shallow to deep, the middle is built on parsing for information extraction, and the top layer illustrates some major categories of NLP applications, including QA. Connecting applications and the downstairs two layers of language processing is the database, used to store the results of information extraction, ready to be applied at any time to support upstairs applications. This general architecture has not changed much since I made it years ago, although the details and layout have been redrawn no less than 100 times. The architecture diagram below is about one of the first 20 editions, involving mainly the backend core engine of information extraction architecture, not so much on the front-end flowchart for the interface between applications and the database. I still remember early in the morning, my boss sent the slide to a Wall Street angel investor, by noon we got his reply, saying that he was very interested. Less than two weeks, we got the first million dollar angel investment check. Investors label it as a million dollar slide, which is believed to have not only shown the depth of language technology but also shows the great potential for practical applications.
Pre-Knowledge Graph: Architecture of Information Extraction Engine
【Related Chinese Blogs】
( translated from http://blog.sciencenet.cn/blog-362400-981742.html )
The speech generation of the fully automatically translated, un-edited science blog of mine is attached below (for your entertainment :=), it is amazingly clear and understandable (definitely clearer than if I were giving this lecture myself with my strong accent). If you are an NLP student, you can listen to it as a lecture note from a seasoned NLP practitioner.
Thanks to the newest Google Translate service from Chinese into English at https://translate.google.com/