Author: Bai Shuo
Recently, Amazon's AI product Echo and its voice assistant Alexa set off a whirlwind in the industry. It has drawn attention from not only the smart home industry but also the AI start-ups as well as the IT giants. So, what exactly is unique about Alexa?
Recently, Amazon's AI product Echo and its voice assistant Alexa set off a whirlwind in the industry. It has drawn attention from not only the smart home industry but also the AI start-ups as well as the IT giants. So, what exactly is unique about Alexa?
Some people say that Alexa has solved the challenging "cocktail party" problem in speech recognition: imagine a noisy cocktail party, where a person is chatting with you, the voice is not loud, but you can accurately capture the speech with no problem while ignoring the surrounding big noise. Alexa models this amazing human capability well, which is said to be missing from other leading speech players, including the global speech leader USTC iFLYTEK Co.
Others say that behind Alexa are very rich cross-domain know-hows: one can ask Alexa for on-demand programs, he can also buy goods and services through it; it can be instructed to control the various appliances of our home, or inquire about all kinds of news. All in all, this is a voice assistant backed by a strong service (with some resources local, and more in the cloud). Apple's Siri or Microsoft's Little Ice are believed to be by no means a match for Alexa in terms of these comprehensive capabilities.
The excellent performance by the end device, coupled with the huge cloud resources in support of the end, constitute Alexa's expected success in customers' stickiness, leading to its legendary value as an information portal for a family. That seems to be a good reason for Alexa's impressive market performance in the US. A considerable number of people seem to realize that this may represent a huge business opportunity, one that simply cannot be missed without regret. Although in other markets beyond the United States, Alexa's performance is not as eye-catching as in the US market, this Alexa whirlwind has till been scraping the world, leading to the industry's greatest buzz and triggering a long list of smart speaker simulation shows.
Hence the questions: What are the effects of this invention of Alexa? Who will be affected or even replaced? How to evaluate Alexa's portal value? Where is it going as we look into the yesterday, today and tomorrow of this trend?
We may wish to reflect a bit on the development of portals in the IT industry history. The so-called "portal" is an entry point or interface for an information network of large data flow, connecting consumers and services. From the model perspective, we have experienced the "web portal" model, the "search engine" model and more recently, the "social network" model, with the on-going trend pointing to a portal moving in the "artificial intelligence" mode. From the carrier perspective, the carrier for the"web portal" and "search engine" models is basically a PC while the "social network" model carrier is mainly a smart phone-based end equipment. Does the "artificial intelligence" model have the potential to change the carrier? In other words, is it possible for the Echo-Alexa hardware-software combination, under the banner of artificial intelligence, to win the portal from the smart phone as the select point of human-machine interface?
I don't think it is possible. There are three reasons.
First, the scene is wrong. Even if Alexa is powerful with unique anti-noise ability and the skills of tracking specific people's speech, since its location is fixed, it is a huge regression from today's well-developed mobile scenes. Just think about it, the biggest feature of a family scene is two or more individuals involved in it. A family is a small society with an innate structure. Who has the right to issue voice commands? Who has the authority to deny or revoke the voice commands that others have already issued? What happens if the authoritative person is not at home or keeps silent? What if a family member intends to send a private voice instruction? To my mind, voice instruction as a human-machine interaction vehicle by nature involves behaviors of an individual, rather than of a family, with privacy as a basic need in this setting. Therefore, the family voice portal scene, where Alexa is now set, is likely to be a contradiction. The more voice commands that are parsed and understood, the less will be the proportion of the voice commands that take the home scenes as a necessary condition.
Second, the "horizontal" mode of portal faces the "vertical" resistance. Even if we agree that the "smart home central control" is a portal of access to end users that cannot be missed by any players, smart speakers like Alexa are also facing challenges from other types of end equipment. There are two types of data flow in the smart home environment. The horizontal mode involves the data flow from different manufacturers of home equipment. The vertical mode portal gathers data from the same manufacturer's home equipment. It can be seen that the "horizontal" effort is bound to face the "vertical" resistance in a life and death struggle. For example, the smart refrigerator and other smart home equipment manufactured by Haier have no reasons to let go its valuable data and flow it away to the smart speaker manufacturers.
Third, the same struggle also comes from other competitions for the "horizontal" line of equipment, including house robots, home gateway / intelligent routers, smart TVs, intelligent pendants and so on. The advantage of the house robots is that their locations need not be fixed in one place, the advantage of the home gateway is that it always stays on, the TVs' advantage lies in their big screens, and intelligent pendants (such as picture frames, sculptures, watches, scales, etc.) have their respective advantage in being small. In my opinion, smart speakers face all these "horizontal" competitions and there does not seem to be much of a chance in winning this competition.
In summary, the Echo-Alexa's success comes with a strong superposition characteristic. It is essentially a success of the Amazon business system, rather than the success of smart home appliances or the voice assistant technology. Ignoring the role of its supporting business system, we are likely to overestimate the value of the family information portal, and by simply mimicking or following the smart speaker technology, there is no way out. Personally, I feel that the smart phone as the carrier of an entry point of information in the mobile Internet era still cannot be replaced.
Is the era of voice interaction really coming?
One important reason for the IT giants to look up to Alexa is that the voice interaction represented by Alexa perhaps opens a new paradigm of human-computer interaction. Looking back in history, the rise of the click-mode and the rise of the touch-mode have both triggered a revolutionary paradigm shift for human-computer interaction, directly determining the rise and fall of the IT giants. The click-mode led to the rise of Wintel, the touch mode enabled Apple to subvert Wintel: we have witnessed all these changes with our own eyes. So if the voice interaction really represents the next generation paradigm for human-computer interaction, then Alexa has a special meaning as the precursor of the human-computer interaction paradigm shift. The giants simply cannot overlook such a shift and its potential revolutionary impact.
However, personally, I do not think that the speech interaction alone carries the weight for an "intergenerational revolution" for human-machine interaction. There are three reasons to support this.
First, the speech itself does not constitute a complete human-computer interaction scene. People's information intake, more than 80% of times, involves the visual information. When speaking, we often take some visual information as basic context, through the use of a pronoun to refer to it. For example, pointing to a book on the screen, one may say, "I want to buy this." In other words, a considerable part of the context in which the speech is delivered comes from the visual presentation, ranging from gestures, touches or eye movements that target some visual objects. This at least shows that we need multi-modal human-computer interaction, rather than using voice alone to replace other human-computer interaction vehicles.
Second, the current speech recognition still cannot handle the dialect well. China is a big country with a variety of dialects. Not only dialects, but also the people in dialect areas speack Mandarin with a strong accent. To benefit more than half of the total population in the dialect areas, the speech technology still needs to go through a stage of further development and maturity.
Third, the current speech recognition still has difficulty in solving the "escape" problem. The so-called escape problem involves the identification of scenarios when the speech refers to itself. When people find there is an error in the first utterance and there is a need to correct it, they may choose to use the next sentence to correct the previous sentence, then this new sentence is not part of the naturally continuous speech commands, hence the need for "being escaped". But it is also possible that the latter sentence should not be escaped, and it is a sentence conjoined with the previous sentence, then it is part of the normal speech stream. This "escape" identification to distinguish different levels of speech referents calls for more advanced semantic analysis technology, which is not yet mature.
So, considering the current level of speech technology, it seems too early to talk about the "intergenerational revolution". Furthermore, speech may well be just one factor, and not necessarily a disruptive one. It seems more reasonable to state that the future of human-computer interaction may enter an era of multi-modal input, rather than speech alone.
The semantic grounding is the key to the stickiness of users.
Semantics as a term seems abused in all kinds of interpretations. Some even think that once words are identified, semantics is there, which is far from true. The semantics of natural languages is very deep and involves a lot. I mean a lot!
From the academic point of view, semantics is divided into two parts. One called "symbol grounding", which is about the relationship of the language symbol (signifier) and its referent to the real world entity (including the conceptual world). The second is called "role assignment", which is about the relationship between the referents of the language symbols in the reality. Siri is the pioneer in the mobile semantic grounding realized in the domain apps such as Address, Map and Weather. The past few years have seen the scope of semantic grounding grow wider and wider.
Let me review what I said before: "the excellent performance by the end equipment, coupled with the huge cloud resources in support of the end, constitute the Alexa's expected success in users' stickiness". We can further explore along this line in this section. Between "the performance by the end equipment" and "the cloud resources in support of the end", which is the root cause for Alexa's stickiness with the customers? I do not intend to play the trick of dialectical balance by saying something like both are important and no one can do the job without the other. That is always true but cheap, and it gives no actionable insights. The consequence includes possible blind investments in both for the copycat, such investments may well lead to a complete failure in the market.
The author argues that "the performance by the end equipment" is about the adaptability of the hardware to the scene. This is at best about a "good live experience" of users. But a product with "good user experience" without real content will soon degrade to a toy, and they cannot even count as high-end toys. If there is no real "meaningful service" associated, there will be no sustainable stickiness of customers. Without user stickiness, they cannot become sustainable data collection entry points as a data flow portal. However, any associated "meaningful services" must come from the semantic grounding, that is, the connection from a speech command with its corresponding actual service. This is the essence behind Alexa's so-called "know-hows." Semantic grounding as mentioned hereafter all refers to such connection from the speech command with infinitely possible actual service resources.
Comprehensive semantic grounding requires a strong open-domain NLP engine. Service resources are so diverse in tens of thousands, and they can hardly be confined to one or only a few narrow domains. An NLP engine functioning only in a narrow domain cannot do this job well. To work in the open domain requires an engine to be equipped with extraordinary capacity in the semantic analysis, and it must be on the right path in the semantic knowledge representation and processing. In this regard, even if an English engine is doing decently well, it does not necessarily mean the Chinese counterpart will work well. For those who do not yet understand the difficulty and pain points of the Chinese NLP engine in the open domain, it is hardly possible to expect them to achieve large-scale semantic grounding effects. Such technology barriers can set apart a huge gap in products attempting to do the same thing in the market between companies equipped with or without deep semantic capabilities.
Semantic grounding requires an engineering adaptation at the interface to the service resources. This is also a very difficult task, and it involves competitions in the scale of resources as well as efficiency and management. Start-up companies can hardly have such a resource integration capacity and the engineering organization capabilities, these are the strength of large companies. Some people say that I can start small and gradually scale up, okay? I said, no, time does not wait for people. In the area of semantic grounding, if products are not developed in a relatively short time to capture the market, there are little chances for survival.
Semantic grounding also calls for the ability to manage the man-machine interactive scene itself. This involves a variety of technologies such as contextual perception, topic switching, sentiment analysis, language style selection, personality shaping and many others. A speech assistant is not necessarily the best if it only mimics human's eloquence or seemingly likable ways of expressions. Skills such as moderate profoundness or sharpness in arguments and even some rudeness at times can all be selling points as an intelligent assistant.
Therefore, we would point out the key role of semantic grounding on the stickiness of Alexa users, emphasizing the decisive contribution of large service resources behind Alexa's success story. In China, if Chinese IT giants with a comparable size of the Amazon service resources do not take the lead, coupled by a solid open domain Chinese NLP engine with a star team, the speech technology alone has no way to generate such a user stickiness as we see in Alexa.
Who will win then?
In essence, it is all about gathering the user data by the end equipments. Smartphones dominate the industry for years, all kinds of smart home solutions across the verticals have also been fighting for several years now. Alexa's coming to the market stirs the industry with a lot of excitement and revelations, but it is far from what is all set. We still have opportunities. But keep in mind, it cannot be overemphasized to look into issues involving the combination of the end devices with the cloud and the combination between the entry point and the entry point carrier to form a closed-loop data stream. If we lose the sense of directions and trends in these issues, the opportunity will not be ours.
So what is the direction and what are the trends? Let me give an analysis.
First, artificial intelligence is bound to be the next generation portal. In other words, all kinds of service needs will inevitably go from the end devices to the cloud through the artificial intelligence multi-channel input analysis, leveraging the human-computer interaction advantages. The variety of service resources will eventually use the knowledge of artificial intelligence and cognitive decision-making ability, to provide to users from the cloud to the end. If you do not lay out a roadmap in developing artificial intelligence, the future portal is definitely not yours.
Second, the smartphone for a long time to come will stay as defacto chief carrier. Wherever is the person going, the communication node and the digital identity will follow and the perception of the life scene and the app as the service agent will also follow. There are no other end devices that match the smartphone on the most critical dimensions of the individualness, privacy, and the ubiquitous nature as needed by a portal carrier.
Third, there will be separation between the communication function of a terminal device and the demanded service function. As the service grows more and more diversified, it becomes impossible for one end device to handle all types of service needs. But it is not desirable for each end device to come with its own communication function. The relationship between Apple Watch and iPhone is intriguing in this regard: iPhone serves as the communication hub as well as the client information processing hub while Apple Watch functions as a special device for information collection and limited information display. They are connected through a "near field communication" link. Of course, both are Apple's products in one family, the data flow is therefore under a unified control. In such a setting, they are tightly coupled, and the separation is always limited. However, this mode sheds lights to the future when all kinds of separation may be required but they should also be connected in some way. If the mobile phone manufacturers keep an open mind, they can use the block chain technology in data collection with a variety of ancillary equipment to make an objective record of the respective contributions and accordingly make reasonable arrangements with regards to the data and proceeds sharing. A loose coupling of the separation will then evolve and mature, promoting the rapid ecological development of end devices in all kinds of forms. It is imaginable that, when we are in a new place, we can take out from our pocket a soft thin foldable electronic map. This map, when unfolded, looks as big as a real paper map, but it works conveniently just like a mobile map app: it responds to the touch operations and may even accommodate speech instructions to associate with our phone. Of course, this map can also simply be a virtual projection, not necessarily taking the form of a real object. Our phone only needs to take care of communication, all the control and display are accomplished on the map, and we do not even need to physically take out the phone. Such a phone may never need to be held in hands, we may even wear the phone on the foot, and the hand mobile device gradually evolves into a "foot phone" ... ...
Are you ready for the opportunity and inspirations brought by the Alexa whirlwind?
Translated by: Dr. Wei Li based on GNMT
【Related】
S. Bai: Natural Language Caterpillar Breaks through Chomsky's Castle