【社煤挖掘:大数据告诉我们,希拉里选情告急】

这是最近最近一周的对比图:

brand-passion-index-15
的确显得不妙,川大叔领先了。是不是因为FBI重启调查造成的结果?
这是过去24小时的图:

brand-passion-index-17
这是一个月的涨跌对比:

timeline-comparison-25

至此局势基本清晰了:希拉里的确选情告急。MD 这大选真是瞬息万变啊,不久前还是喜妈领先或胶着,如今川大叔居然翻身了,选情的变化无常真是让人惊心动魄。

这是last week:

timeline-comparison-26

这一周喜婆,很被动很不利。过去24小时 一直在零下20上下,而老川在零上10左右,有30点的差距 NND:

timeline-comparison-27

看看更大的背景,过去三个月的选情对比:

timeline-comparison-28

原来是, 喜大妈好容易领先了,此前一直落后,直到九月底。九月底到十月中是喜妈的极盛期,是川普的麻烦期。

至于热议度,从来都没有变过,总是川普压倒:

timeline-comparison-31

眼球数也是一样:

timeline-comparison-32

一年来的狂热度(passion intensity)基本上也是川普领先,但喜婆也有不有不少强烈粉她或恨她的,所以曲线有交叉:

timeline-comparison-33

这个 passion intensity 与所谓 engagement 应该有强烈的正相关,因为你痴迷或痛恨一个 candidate 你就愿意尽一切所能去投入、鼓噪、撕逼。

最好是赶快把川大叔的最新丑闻抖出来。这家伙那么多年,难道就留不下比电话录音更猛、更铁的丑闻证据。常识告诉我们肯定有 skeleton in the closet,可是这家伙太狡猾,可能一辈子做商人太过精明,连染有液体的内裤也不曾留下过?是时候从 closet 拿出来了。反正这次大选已经 low 得不能再 low 了,索性 low 到底。不过如果要是有,不会等到今天,大选只剩下一周、先期投票已经开始。

这么看来,作为 data scientist,我不敢不尊重 data 一厢情愿宣传喜妈的赢面大了。赶巧我一周前调查的那个月是克林顿选情的黄金月,结果令人鼓舞。

我们的大数据平台有 27 种 filters,用我们的大数据工具可以对数据做不同的组合切割,要是在会玩的分析师手中,可以做出很漂亮的各种角度的分析报告和图表出来。地理、时间只是其中两项。

电邮门是摧毁性的。FBI 选在大选前一周重启,这个简直是不可思议。比川普的录音曝光的时间点厉害。那家印度所谓AI公司押宝可能押对了,虽然对于数据的分析能力和角度,远不如我们的平台的丰富灵活。他们基本只有一个 engagement 的度量,连最起码的 sentiment classification 都没有,更不用说 social media deep sentiments 了。无论怎么说,希拉里最近选情告急是显然的。至于这种告急多大程度上影响真正的选票,还需要研究。

朋友提醒所谓社会媒体,其实是 pull 和 push 两种信息的交融,其来源也包含了不少news等,这些自上而下的贴子反映的是两党宣传部门的调子,高音量,影响也大,但并非真正的普罗网虫自下而上的好恶和呼声,最好是尽可能剔除前者才能看清真正的民意。下面的一个月走势对比图,我们只留下 twitter,FB,blog 和 microblog 四种社会媒体,剔除了 news 和其他的社会媒体:

timeline-comparison-49

下面是推特 only,大同小异:

timeline-comparison-50

对比一下所有的社会媒体,包括 news 网站,似乎对于这次大选,pull 和 push的确是混杂的,而且并没有大的冲突和鸿沟:

timeline-comparison-51

希拉里为什么选情告急?看看近一个月的希拉里云图,开始红多绿少了:

sentiment-drivers-43

sentiment-drivers-44

对比一下川普的云图,是红绿相当,趋向是绿有变多的趋势,尤其是第二张情绪(emotion)性云图:

sentiment-drivers-45

sentiment-drivers-46

再看看近一周的云图对比, 舆论和选情的确在发生微妙的变化。这是川普最近一周的sentiment 云图:

sentiment-drivers-47

sentiment-drivers-48
对比喜婆婆的一周云图:

sentiment-drivers-49

sentiment-drivers-50

下面是网民的针对希拉里来的正负行为表述的云图:

sentiment-drivers-51

not vote 希拉里的呼声与 vote for her 的不相上下。对比一下川普最近一周的呼声:

sentiment-drivers-52
vote 的呼声超过 not vote for him

这是最近一周关于克林顿流传最广的posts:

clinton_trouble

FBI 重启调查显然被川普利用到了极致,影响深远。

Most popular posts last week by engagement:

clinton_trouble1

Most popular posts last week on Clinton by replies and comments:

clinton_trouble2

Some random sample posts:

clinton_tposts_random
negative comments are rampant on Clinton recently:

clinton_tposts

29367bc4bae054ee9a6262d9cccdfed6

如果这次希拉里输了,the FBI director Comey 居功至伟。因为自从录音丑闻以后,选情对希拉里极为有利,选情的大幅度下滑与FBI重启调查紧密相关。媒体的特点是打摆子,再热的话题随着时间也会冷却,被其他话题代替。这次的问题在,FBI 重启电邮门调查的话题还没等到冷却,大选就结束了,媒体和话题对选民的影响当下为重。而录音丑闻的话题显然已经度过了发酵和热议期,已经冷却,被 FBI 话题代替了。从爆料的角度,录音丑闻略微早了一些,可谁料到在这个节骨眼 FBI 突然来这么一招呢。

看看最近一周的#Hashtags,也可以了解一点社会媒体话题的热度:

word-cloud-23

与事件有关的有: #fbi #hillarysemails #hillarysemail #podestaemails19 #podestaemails20
Negative ones include: #wikileaks #neverhillary #crookedhillary #votetrump

Look at the buzz around Hillary below: the biggest is "FBI" in the brands cloud mentioned with her in the last week's data:

word-cloud-24

The overall buzz last week:

word-cloud-26

这是最近一周有关希拉里话题的emoji图:

hullery1weekemoji

虽然说笑比哭还,希拉里及其阵营和粉丝却笑不起来,一周内用到这个话题的emoji总数高达 12,894,243 。这也是社会媒体的特点吧,用图画表达情绪。情绪的主调就是 哭。邮件门终于炸了。

现在的纠结是,【大数据告诉我们,希拉里选情告急】,到底发还是不发?为了党派利益和反川立场,不能发。长老川志气,灭吾党威风。为了 data scientist 的职业精神,应该发。一切从数据和事实出发,是信息时代之基。中和的办法是,先发一篇批驳那篇流传甚广的所谓印度AI公司预测川普要赢,因为那一篇的调查区间与我此前做的调查区间基本相同,那是希拉里选情最好的一个月,他们居然根据 engagement alone 大嘴巴预测川普的胜选,根本就没有深度数据的精神,就是赌一把而已。也许等批完了伪AI,宣扬了真NLU,然后再发这篇 【大数据告诉我们,希拉里选情告急】。

FBI director 说这次重启调查,需要很长时间才能厘清。现在只是有了新线索需要重启,不能说明希拉里有罪无罪。没有结论前,先弄得满城风雨,客观上就是给选情带来变数。虽然在 prove 有罪前,都应该假定无罪,但是只要有风声,人就不可能不受影响。所以说这个时间点是最关键的。如果这次重启调查另有黑箱,就更惊心动魄了。如果不是有背后的黑箱和势力,这个时间点的电邮门爆炸纯属与新线索的发现巧合,那就是希拉里的运气不佳,命无天子之福。一辈子强性格,卧薪尝胆,忍辱负重,功亏一篑,无功而返,保不准还有牢狱之灾。可以预测,大选失败就是她急剧衰老的开始。

一周前有个记者interview川普,川普一再说,希拉里这个犯罪的人,根本就不该被允许参加竞选。记者问,哪里犯罪了?川普说电邮门泄密,还有删除邮件隐瞒罪恶。当时这个重启调查还没有。记者问,这个案子不是有结论了吗,难到你不相信FBI的结论?川普说,他们弄错了,把罪犯轻易放了。这是一个腐烂的机构,blah blah。可是,同样这个组织,老川现在是赞誉有加。这就是一个无法无天满嘴跑火车的老狐狸。法律对他是儿戏,顺着他的就对,不顺着他心意的就是 corrupt,rigged,这种人怎么可以放心让他当总统?

中间选民的数量在这种拉锯战中至关重要,据说不少。中间选民如果决定投票,其趋向基本决定于大选前一周的舆论趋向。本来是无所谓是鸡是鸭的,如今满世界说一方不好,合理的推断就是去投另一方了。现在看来,这场竞赛的确是拉锯战,很胶着,不是一方远远超过另一方。一个月前,当录音丑闻爆料的时候,那个时间点,希拉里远远超过川普,毫无悬念。一个月不到,选情大变,就不好说了,迹象是,仍然胶着。

不过,反过来看,川普的 popularity 的确是民意的反映。不管这个人怎么让人厌恶,他所批判的问题的确长久存在。某种意义上,Sanders 这样的极端社会主义者今年能有不俗的表现,成为很多年轻一代的偶像,也是基于类似的对现状不满、对establishment的反叛的民意。而希拉里显然是体系内的老旧派,让人看不到变革的希望。人心思变的时候,一个体系外的怪物也可以被寄托希望。至少他敢于做不同事情,没有瓶瓶罐罐的牵扯。

上台就上台吧,看看他造出一个什么世界。

老闻100年前就说过:
这是一沟绝望的死水,清风吹不起半点漪沦。不如多扔些破铜烂铁,爽性泼你的剩菜残羹。
。。。。。。
这是一沟绝望的死水,这里断不是美的所在,不如让给丑恶来开垦,看它造出个什么世界。

 

【相关】

CNBC‎: AI system finds Trump will win the White House and is more popular than Obama in 2008

Trump sucks in social media big data in Spanish

Did Trump’s Gettysburg speech enable the support rate to soar as claimed?

【社煤挖掘:川普的葛底斯堡演讲使支持率飙升了吗?】

【社煤挖掘:为什么要选ta而不是ta做总统?】

Big data mining shows clear social rating decline of Trump last month

Clinton, 5 years ago. How time flies …

【社媒挖掘:川大叔喜大妈谁长出了总统样?】

【川普和希拉里的幽默竞赛】

【大数据舆情挖掘:希拉里川普最近一个月的形象消长】

欧阳峰:论保守派该投票克林顿

【立委科普:自动民调】

【关于舆情挖掘】

《朝华午拾》总目录

 

Trump sucks in social media big data in Spanish

As promised, let us get down to the business of big data mining of public opinions and sentiments from Spanish social media on the US election campaign.

We know that in the automated mining of public opinions and sentiments for Trump and Clinton we did before, Spanish-Americans are severely under-represented, with only 8% Hispanic posters in comparison with their 16% in population according to 2010 census (widely believed to be more than 16% today), perhaps because of language and/or cultural barriers.  So we decide to use our multilingual mining tools to do a similar automated survey from Spanish Social Media to complement our earlier studies.

This is Trump as represented in Spanish social media for the last 30 days (09/29-10/29), the key is his social rating as reflected by his net sentiment -33% (in comparison with his rating of -9% in English social media for the same period): way below the freezing point, it really sucks, as also illustrated by the concentration of negative Spanish expressions (red-font) in his word cloud visualization.

By the net sentiment -33%, it corresponds to 242,672 negative mentions vs. 121,584 positive mentions, as shown below. In other words, negative comments are about twice as much as positive comments on Trump in Spanish social media in the last 30 days.

This is the buzz in the last 30 days for Trump: mentions and potential impressions (eye balls): millions of data points and indeed a very hot topic in the social media.

This is the BPI (Brand Passion Index) graph for directly comparing Trump and Clinton for their social ratings in the Spanish social media in the last 30 days:

As seen, there is simply no comparison: to refresh our memory, let us contrast it with the BPI comparison in the English social media:

Earlier in one of my election campaign mining posts on Chinese data, I said, if Chinese only were to vote, Trump would fail horribly, as shown by the big margin in the leading position of Clinton over Trump:

This is even more true based on social media big data from Spanish.

This is the comparison trends of passion intensity between Trump and Clinton:

The visualization by weeks of the same passion intensity data, instead of by days, show even more clearly that people are very passionate about both candidates in the Spanish social media discussions, the intensity of sentiment expressed for Clinton are slightly higher than for Trump:

This is the trends graph for their respective net sentiment, showing their social images in Spanish-speaking communities:

We already know that there is simply no comparison: in this 30-day duration, even when Clinton dropped to its lowest point (close to zero) on Oct 9th, she was still way ahead of Trump whose net sentiment at the time was -40%. In any other time segments, we see an even bigger margin (as big as 40 to 80 points in gap) between the two. Clinton has consistently been leading.

In terms of buzz, Trump generates more noise (mentions) than Clinton consistently, although the gap is not as large as that in English social media:

This is the geo graph, so the social data come from mostly the US and Mexico, some from other Latin America countries and Spain:

Since only the Mexicans in the US may have the voting power, we should exclude media from outside the US to have a clearer picture of how the Spanish-speaking voters may have an impact on this election. Before we do that filtering, we note the fact that Trump sucks in the minds of Mexican people, which is no surprise at all given his irresponsible comments about the Mexican people.

Our social media tool is equipped with geo-filtering capabilities: you can add a geo-fence to a topic to retrieve all social media posts authored from within a fenced location. This allows you to analyze location-based content irrespective of post text. That is exactly what we need in order to do a study for Spanish-speaking communities in the US who are likely to be voters, excluding those media from Mexico or other Spanish-speaking countries. communities in the US who are likely to be voters, excluding those media from Mexico or other countries. This is also needed when we need to do study for those critical swing states to see the true pictures of the likelihood of the public sentiments and opinions in those states that will decide the destiny of the candidates and the future of the US (stay tuned, swing states social media mining will come shortly thanks to our fully automated mining system based on natural language deep parsing).

Now I have excluded Spanish data from outside America, it turned out that the social ratings are roughly the same as before: the reduction of the data does not change the general public opinions from Spanish communities, US or beyond US., US or beyond US. This is US only Spanish social media:

This is summary of Trump for Spanish data within US:

It is clear that Trump's image truly sucks in the Spanish-speaking communities in the US, communities in the US, which is no surprise and so natural and evident that we simply just confirm and verify that with big data and high-tech now.

These are sentiment drivers (i.e. pros and cons as well as emotion expressions) of Trump :

We might need Google Translate to interpret them but the color coding remains universal: red is for negative comments and green is positive. More red than green means a poor image or social rating.

In contrast, the Clinton's word clouds involve way more green than red: showing her support rate remains high in the Spanish-speaking communities of the US.

It looks like that the emotional sentiments for Clinton are not as good as Clinton's sentiment drivers for her pros and cons.

Sources of this study:

Domains of this study:

[Related]

Did Trump's Gettysburg speech enable the support rate to soar as claimed?

Big data mining shows clear social rating decline of Trump last month

Clinton, 5 years ago. How time flies …

Automated Suevey

Dr Li’s NLP Blog in English

Did Trump's Gettysburg speech enable the support rate to soar as claimed?

Last few days have seen tons of reports on Trump's Gettysburg speech and its impact on his support rate, which is claimed by some of his campaign media to soar due to this powerful speech.  We would love to verify this and uncover the true picture based on big data mining from the social media.

First, here is one link on his speech:

DONALD J. TRUMP DELIVERS GROUNDBREAKING CONTRACT FOR THE AMERICAN VOTER IN GETTYSBURG. (The most widely circulated related post in Chinese social media seems to be this: Trump's heavyweight speech enables the soaring of the support rate and possible stock market crash).

Believed to be a historical speech in his last dash in the campaign, Trump basically said: I am willing to have a contract with the American people on reforming the politics and making America great again, with this plan outline of my administration in the time frame I promised when I am in office, I will make things happen, believe me.

Trump made the speech on the 22nd this month, in order to mine true public opinions of the speech impact, we can investigate the data around 22nd for the social media automated data analysis.  We believe that automated polling based on big data and language understanding technology is much more revealing and dependable than the traditional manual polls, with phone calls to something like 500 to 1,000 people.  The latter is laughably lacking sufficient data to be trustworthy.

timeline-comparison-14

What does the above trend graph tell us?

1  Trump in this time interval was indeed on the rise. The "soaring" claim this time does not entirely come out of nowhere, but, there is a big BUT.

2. BUT, a careful look at the public opinions represented by net sentiment (a measure reflecting the ratio of positive mentions over negative mentions in social media) shows that Trump has basically stayed below the freezing point (i.e. more negative than positive) in this time interval, with only a brief rise above the zero point near the 22nd speech, and soon went down underwater again.

3. The soaring claim cannot withstand scrutiny at all as soaring implies a sharp rise of support after the speech event in comparison with before, which is not the case.

4. The fact is, Uncle Trump's social media image dropped to the bottom on the 18th (with net sentiment of -20%) of this month.  From 18th to 22nd when he delivered the speech, his net sentiment was steadily on rise from -20% to 0), but  from 22nd to 25th, it no longer went up, but fell back down, so there is no ground for the claim of support soaring as an effect of his speech, not at all.

5. Although not soaring, Uncle Trump's speech did not lead to sharp drop either, in terms of the buzz generated, this speech can be said to be fairly well delivered in his performance. After the speech, the net sentiment of public opinions slightly dropped, basically maintaining the fundamentals close to zero.

6.  The above big data investigation shows that the media campaign can be very misleading against the objective evidence and real life data.  This is all propaganda, which cannot be trusted at its face value: from so-called "support rate soared" to "possible stock market crash". Basically nonsense or noise of campaign, and it cannot be taken seriously.

The following figure is a summary of the surveyed interval:

trump1

As seen, the average public opinion net-sentiment for this interval is -9%, with positive rating consisting of 2.7 million mentions, and negative rating of 3.2 million mentions.

How do we interpret -9% as an indicator of public opinions and sentiments? According to our previous numerous automated surveys of political figures, this is certainly not a good public opinion rating, but not particularly bad either as we have seen worse.  Basically, -9% is under the average line among politicians reflecting the public image in people's minds in the social media.  Nevertheless, compared with Trump's own public ratings before, there is a recorded 13 points jump in this interval, which is pretty good for him and his campaign.  But the progress is clearly not the effect of his speech.

This is the social media statistics on the data sources of this investigation:

trump2

In terms of the ratio, Twitter ranks no 1, it is the most dynamic social media on politics for sure, with the largest amount of tweets generated every minute. Among a total of 34.5 million mentions on Trump, Twitter accounted for 23.9 million.  In comparison, Facebook has 1.7 million mentions.

Well, let's zoom in on the last 30 days instead of only the days around the speech, to provide a bigger background for uncovering the overall trends of this political fight in the 2016 US presidential campaign between Trump and Clinton.

timeline-comparison-15

The 30 days range from 9/28-10/28, during which the two lines in the comparison trends chart show the contrast of Trump and Clinton in their respective daily ups and downs of net sentiment (reflecting their social rating trends).  The general impression is that the fight seems to be fairly tight.  Both are so scandal-ridden, both are tough and belligerent.  And both are fairly poor in social ratings.  The trends might look a bit clearer if we visualize the trends data by weeks instead of by day:

timeline-comparison-16

No matter how much I dislike Trump, and regardless of my dislike of Clinton whom I have decided to vote anyway in order to make sure the annoying Trump is out of the race,  as a data scientist, I have to rely on data which says that Hillary's recent situation is not too optimistic: Trump actually at times went a little ahead of Clinton (a troubling fact to recognize and see).

timeline-comparison-17

The graph above shows a comparison of the mentions (buzz, so to speak).  In terms of buzz, Trump is a natural topic-king, having generated most noise and comments, good or bad.  Clinton is no comparison in this regard.

timeline-comparison-18

The above is a comparison of public opinion passion intensity: like/love or dislike/hate?  The passion intensity for Trump is really high, showing that he has some crazy fans and/or deep haters in the people.  Hillary Clinton has been controversial also and it is not rare that we come across people with very intensified sentiments towards her too.  But still, Trump is sort of political anomaly, and he is more likely to cause fanaticism or controversy than his opponent Hillary.

In his recent Gettysburg speech, Trump highlighted the so-called danger of the election being manipulated. He clearly exaggerated the procedure risks, more than past candidates in history using the same election protocol and mechanism.  By doing so, he paved the way for future non-recognition of the election results. He was even fooling the entire nation by saying publicly nonsense like he would totally accept the election results if he wins: this is not humor or sense of humor, it depicts a dangerous political figure with ambition unchecked.  A very troubling sign and fairly dirty political tricks or fire he is playing with now, to my mind.  Now the situation is, if Clinton has a substantial lead to beat him by a large margin, this old Uncle Trump would have no excuse or room for instigating incidents after the election.  But if it is closer to see-saw, which is not unlikely given the trends analysis we have shown above, then our country might be in some trouble: Uncle Trump and his die-hard fans most certainly will make some trouble.  Given the seriousness of this situation and pressing risks of political turmoil possibly to follow,  we now see quite some people, including some conservative minds, begin to call for the election of Hillary for the sake of preventing Trump from possible trouble making.  I am one with that mind-set too, given that I do not like Hillary either.  If not for Trump, in ordinary elections like this when I do not like candidates of both major parties, I would most likely vote for a third party, or abstain from voting, but this election is different, it is too dangerous as it stands.  It is like a time bomb hidden somewhere in the Trump's house, totally unpredictable. In order to prevent him from spilling, it is safer to vote for Clinton.

In comparison with my earlier automated sentiment analysis blogged about a week ago (Big data mining shows clear social rating decline of Trump last month),this updated, more recent BPI brand comparison chart seems to be more see-saw: Clinton's recent campaign seems to be stuck somewhere.

brand-passion-index-11

Over the last 30 days, Clinton's net sentiment rating is -17%, while Trump's is -19%.  Clinton is only slightly ahead of Trump.  Fortunately, Trump's speech did not really reverse the gap between the two, which is seen fairly clearly from the following historical trends represented by three different circles in brand comparison (the darker circle represents more recent data): the general trends of Clinton are still there: it started lagging behind and went better and now is a bit stuck, but still leading.

 

brand-passion-index-12

Yes, Clinton's most recent campaign activities are not making significant progress, despite more resources put to use as shown by bigger darker circle in the graph.  Among the three circles of Clinton, we can see that the smallest and lightest circle stands for the first 10 days of data in the past 30 days, starting obviously behind Trump.  The last two circles are data of the last 20 days, seemingly in situ, although the circle becomes larger, indicating more campaign input and more buzz generated.  But the benefits are not so obvious.  On the other side, Trump's trends show a zigzag, with the overall trends actual declining in the past 30 days.  The middle ten days, there was a clear rise in his social rating, but the last ten days have been going down back.  Look at Trump's 30-day social cloud of Word Cloud for pros and cons and Word Cloud for emotions:

Let us have a look at Trump's 30-day social media sentiment word clouds, the first is more about commenting on his pros and cons, and the second is more direct and emotional expressions on him:sentiment-drivers-38

sentiment-drivers-37
One friend took a glance at the red font expression "fuck", and asked: who are subjects and objects of "fuck" here?  In fact, the subject generally does not appear in the social posts, by default it is the poster himself, reflecting part of the general public, the object of "fuck" is, of course, Trump, for otherwise our deep linguistics based system will not count it as a negative mention of trump reflected in the graph.  Let us show some random samples side by side of the graph:

trumpfuck

trumpfuck2
My goodness, the "fuck" mentions account for 5% of the emotional data, the poor old Uncle Trump is fucked 40 million times in social media within one-month duration, showing how this guy is hated by some of the people whom he is supposed to represent and govern if he takes office.   See how they actually express their strong dislike of Trump:

fucking moron
fucking idiot
asshole
shithead

you name it, to the point even some Republicans also curse him like crazy:

Trump is a fucking idiot. Thank you for ruining the Republican Party you shithead.

Looking at the following figure of popular media, it seems that the most widely circulated political posts in social media involve quite some political video works:

trumpmedia

The domains figure below shows that the Tumblr posts on politics contribute more than Facebook:

domains-6

In terms of demographics background of social media posters, there is a fair balance between male and female: male 52% female 48% (in contrast to Chinese social media where only 25% females are posting political comments on US presidential campaign).  The figure below shows the ethnic background of the posters, with 70% Caucasians, 13% African Americans, 8% Hispanic and 6% Asians.  It looks like that the Hispanic Americans and Asian Americans are under-represented in the English social media in comparison with their due population ratios, as a result, this study may have missed some of their voice (but we have another similar study using Chinese social media, which shows a clear and big lead of Clinton over Trump; given time, we should do another automated survey using our multilingual engine for Spanish social media.  Another suggestion from friends is to do a similar study on swing states because after all these are the key states that will decide the outcome of this election, we can filter the data by locations where posts are from to simulate that study).  There might be a language or cultural reasons for this under-representation.

trumpethinics

This last table involves a bit of fun facts of the investigation.  In social media, people tend to talk most about the campaign, on the Wednesday and Sunday evenings, with 9 o'clock as the peak, for example, on the topic of Trump, nine o'clock on Sunday evening generated 1,357,766 messages within one hour.  No wonder there is no shortage of big data from social media on politics.  It is all about big data. In contrast, with the traditional  manual poll, no matter how sampling is done, the limitation in the number of data points is so challenging:
with typically 500 to 1000 phone calls, how can we trust that the poll represents the public opinions of 200 million voters?  They are laughably too sparse in data.  Of course, in the pre-big-data age, there were simply no alternatives to collect public opinion in a timely manner with limited budgets.  This is the beauty of Automatic Survey, which is bound to outperform the manual survey and become the mainstream of polls.

trumpdayhour

Authors with most followers are:

trumpmedia2

Most mentioned authors are listed below:

trumpauthors

Tell me when in history did we ever have this much data and info, with this powerful data mining capabilities of fully sutomated mining of public opinions and sentiments at scale?

trumppopularposts

 

[Related]

Big data mining shows clear social rating decline of Trump last month

Clinton, 5 years ago. How time flies …

Automated Suevey

Dr Li’s NLP Blog in English

 

 

【社煤挖掘:川普的葛底斯堡演讲使支持率飙升了吗?】

反正日夜颠倒了,那就较真一下,看看大数据大知识,对于川普的葛底斯堡演说的所谓舆情飙升到底是怎么回事。先给几个links:

DONALD J. TRUMP DELIVERS GROUNDBREAKING CONTRACT FOR THE AMERICAN VOTER IN GETTYSBURG

报道的是本月22日川大叔的历史性演说,旨在振奋人心,做竞选的最后冲刺,大意:
寡人与美国人民有个约定,看我的,believe me

中文舆论中,这篇似乎流传最广:【川普重磅演讲致支持率飙升 全球股市将暴跌?】。

因为川普演说是22日,为了看舆情的飙升对比,可以以22日为中心取前后几天的社会媒体大数据做分析,看个究竟。至少比传统民调打五百、一千个电话来调查,自动民调的大数据(millions 的数据点)还是靠谱一些吧。

timeline-comparison-14
这张趋势图怎么看?

1 川普在这个时间区间总体的确是上升。飙升之说,不完全是无中生有(准确地说,其实是捕风捉影,见下)。

2 但是,仔细看舆情(net sentiment)图可以发现,川普这段时间基本上还是一直没有摆脱负面舆情多于正面舆情的局面,舆情曲线除了22号当天短暂超越冰点,总体一直是零下。

3. 飙升之说经不起推敲,因为凡飙升,必须是事件后比事件前的舆情,有明显的飞跃,其实不然。

4. 事实是,川大叔近期舆情的谷底是本月18号(零下20+),从18号到22号 他 deliver speech 前,他的舆情已经有比较明显的提升(从 -20 到 0),而从 22 号 到 25 号,舆情不升反略降,飙升从何谈起?

5. 虽然没有飙升,但川大叔这次表演还是及格的。至少 speech 后,舆情没有大跌,基本保持了接近零度的基本面。

6 由此可见,媒体造势是多么地捕风捉影。以后各位看到这种明显是宣传(propaganda)的帖子,可以多一个心眼了:通常的宣传造势的帖子都在夸大其词(如果不公然颠倒黑白或歪曲事实的话),从所谓“舆情飙升”到预计“股市暴跌”,都是要显示川普演说的重量级。基本是无稽之言,不能当真的。

下图是这个调查区间的数据小结:

trump1

这个区间的平均舆情指数是 -9%,2.7 million 的正面评价,3.2 million 的负面评价。

-9% 是一个什么概念,根据我们以往对政治人物的多次舆情调查来看,这不是一个好的舆情,但也不是特别糟糕,属于平均线下。但是,与川普自己的总体舆情比较,这个区间表现良好,有 13 点的提升,但这个提升并非所谓演说飙升带来的。

这是社煤数据源的统计:

trump2

从比例看,推特永远是最 dynamic,量也最大,总热议度 34.5 million mentions,推特占了 23.9 million。不少社煤的分析 apps 干脆扔掉其他的数据源,只做推特,作为社会媒体的代表,也基本上可以了。但是,感觉上还是,只做推特,虽然大数据之量可以保证,但可能偏差会大一些,因为喜欢上推特跟踪政治人物和话题,吐槽或粉丝的人,只是社会阶层中的一部分,往往是比较狂热的一批。推特这个公共平台,本来就长于偶像和followers(粉丝或“黑”)互动。其他的社会媒体可能更平实一些,譬如 Facebook 上的发言基本是说给朋友圈的。Facebook 也有 1.7 million 的热议。

好,我们把区间放大,看 last 30 days 的趋势,作为这次演说前后趋势的一个背景。

timeline-comparison-15
这是 9/28-10/28 的川普与克林顿舆情趋势对比图,by days;仔细解读前,总体印象是够纠缠的。这两位老头老太也真是,剪不断理还乱,不是冤家不碰头,呵呵。两位都那么多丑闻缠身,性格都很tough倔强。看看一个月来 by weeks 的曲线也许更明朗:

timeline-comparison-16

不管我多么厌恶川普,也不管我为了厌恶川普而决定选举并不喜欢的克林顿,作为 data scientist,不得不说,希拉里最近的情势不是很乐观:川普居然开始有点儿领先克林顿的趋势了,NND。

timeline-comparison-17

上图是热议度(mentions)的对比。这个没的说,川普天生的话题大王,克林顿无论如何也赶不上。

timeline-comparison-18

这是舆情烈度的对比:喜欢或厌恶川普的还是更加狂热,虽然印象中希拉里克林顿比起其他政治人物所引起的情绪已经要更趋于激烈了。可是川普是个政治异数,还是更容易引起狂热或争议。

川普在演说中特别强调选举被操纵的危险,他显然在夸大这种危险,为将来的不承认选举结果做铺垫。挺恶心人的。现在的情况是,如果克林顿大幅度领先,川大叔再流氓也没辙。如果是拉锯接近,就麻烦了,老川和川粉几乎肯定要闹事。可现在的选情显得有些胶着拉锯,这也是为什么很多人包括保守派开始有倡议,说为了川普,请投票克林顿。本来我是要投第三党的,或者弃权不投,但是这次选举不同,危险太大,川老是个定时炸弹,而且不可预测。为了防止他撒泼,还是投给克林顿好。至少让他看看,马戏团的表演是上不了台面的,由不得他胡来。沐猴而冠变不成林肯。

对比我 一周前做的自动民调 Big data mining shows clear social rating decline of Trump last month,下面这个品牌对比图似乎更加拉锯,克林顿最近选情不是很佳。

brand-passion-index-11

最近30天,克林顿是 -17%,川普是 -19%,略领先于川普。所幸,川普的这次演讲并没有真正扭转两人的差距,从下面这张历史趋势品牌对比看,克林顿从开始的舆情落后,变为领先的趋势还在:

brand-passion-index-12
不过最近克林顿的选情是原地踏步,并没有明显进展。比较克林顿的三个圈可知,最淡的圈是过去30天的前10天,明显落后于川普,后两个圈是最近20天,基本原地,只是圈子变大了,说明竞选的投入和力度加大了,但效益并不明显。而从川普方面的三个圈圈看趋势,这老头儿实际的总体趋势是下跌,过去三十天,中间的十天舆情有改观,但最近的十天又倒回去了,虽然热议度有增长。(MD,这个分析没法细做,越做越惊心动魄,很难保持平和的心态,可咱是 data scientist 啊。朋友说,“就是要挖点惊心动魄的”,真心唯恐天下不乱啊。)看看川普的30天社煤的褒贬云图(Word Cloud for pros and cons)和情绪云图(Word Cloud for emotions)吧:

sentiment-drivers-38

sentiment-drivers-37
朋友一眼看中了那红红的 fuck 舆情,问:“fuck”的主语和宾语是谁?

主语一般不出现,默认是普罗网虫,fuck 的宾语当然是川普,否则上不来他的负面情绪云图:

trumpfuck

trumpfuck2
天,fuck mentions 占据了情绪数据的 5%,老川在一个月里被社煤普罗 fuck 了近40万次,可见这家伙如果上台会有多少与他不共戴天的子民。看上面怎么吐槽 fuck 的:

fucking moron
fucking idiot
asshole
shithead

you name it,甚至疑似共和党人也fuck他:
Trump is a fucking idiot. Thank you for ruining the Republican Party you shithead.

 

看 popular media,貌似流传最广的大多是视频:

trumpmedia

Tumblr 超越 Facebook 成为社煤老二?

domains-6

从来没用过 Tumblr 这名字也拗口 怎么这么 popular?

西方媒体吐槽的,男女比较均衡:male 52% female 48%,对比中文社媒,明显是女人少谈政治的:才占25%。这次调查的种族背景分布:

trumpethinics

还是白大哥占压倒多数。族裔信息占社煤帖子中的近一半,所以这个社煤族裔分布的情报应该是靠谱的。黑大哥第二,占 13%,亚裔才 6%。墨大哥 8%, 与其人口比例不相称吧(?):由于语言或文化障碍,under-represented here??

这个有点意思,喜欢到社煤吐槽的人,集中在周三和周日的晚上,晚九点达到高峰, 譬如 关于川普话题的社煤,在周日晚上九点高达 1,357,766, 一个小时就有一百三十五万帖啊,够大数据吧。

trumpdayhour

这还才是 sampling 的 data, 推特sampling占总量大约十分之一吧,如果是 data hose (要额外付钱的)一网打尽的话,数据量又要增加一个量级。不过,对于大数据情报挖掘,再增加一个量级已经没有什么意义了,不会实质上改变调查的结果的。说明一下,那个周日的统计量应该是过去一个月的调查中的周日的总和,一个月有四个周日,那个数据应该除以4,然后乘以10,才是川普数据周日九点的那是时间区间的真实量。总之是地地道道的大数据。相比之下,传统民调,不管怎么抽样,感觉都是儿戏,有点胡闹:
500 个电话,说是代表了两亿人的民意舆情,不是儿戏是什么。不过,前大数据时代,那是没办法的办法。自动民调是大势所趋

下图是影响最大 followers 最多的 authors:

trumpmedia2

Most mentioned authors below:

trumpauthors

什么时代有过如此丰富的信息与如此强大的数据挖掘能力?

RW:
@wei 你实际上可以好好搞一个大选预测引擎,利用你现在的methodology, finetune 一下,可以吸引很多眼球。效果好,下次就可以收费了。一炮而红,还有什么是更有效的marketing?

我:
我要是有微信数据的话,不打炮也会红。什么都不用变,就是现在的引擎,现在的app,只要有微信,什么情报专家也难比拟。为什么现在发布中文舆情挖掘不如英文挖掘那么有底气?不是我中文不行,而是数据源太 crappy 了。闹来闹去也就是新浪微博、天涯论坛、中文推特或脸书。至少全球华人大陆背景的,这个压倒多数,都在用微信,而数据够不着,得不到反映。

李:
@wei 我公司有团队做着类似的事情

我:
你能染指微信数据?

李:
微信个人数据只有腾讯有。

看看流传最广的社煤帖子都是什么?

trumppopularposts

从 total engagement 指标看,无疑是川普自己的推特账号,以及 Fox : 这大概是唯一的主流媒体中仅存的共和党的声音了。也不怪,老川在竞选造势中,不断指着鼻子骂主流媒体,甚至刻薄主持人的偏袒。历史上似乎还没有一个候选人与主流媒体如此对着干,也没有一个人被主流媒体如此地厌恶。

展示到这里,朋友转来一个最新的帖子,说是用人工智能预测美国大选,川普会赢:Trump will win the election and is more popular than Obama in 2008, AI system finds,quote:

"But the entrepreneur admitted that there were limitations to the data in that sentiment around social media posts is difficult for the system to analyze. Just because somebody engages with a Trump tweet, it doesn't mean that they support him. Also there are currently more people on social media than there were in the three previous presidential elections."

haha,同行是冤家,他的AI能比我自然语言deep parsing支持的 I 吗?从文中看,他着重 engagement,这玩意儿的本质就是话题性、热议度吧。早就说了,川普是话题大王,热议度绝对领先。(就跟冰冰一样,话题女王最后在舆情上还是败给了舆情青睐的圆圆,不是?)不是码农相轻,他这个很大程度上是博眼球,大家都说川普要输,我偏说他必赢。两周后即便错了,这个名已经传出去了。川普团队也会不遗余力帮助宣传转发这个。

Xi:
那个印度鬼子也有点瞎扯了。
知道ip地址跟知道ssl加密后的搜索的内容是两码事儿啊!
不知道是记者不懂呢,还是这小子就是在瞎胡弄了。

洪:
印度ai公司预测美国大选,有50%以上测准概率,中国ai公司也别放过这个机会

毛:
伟哥为什么认为川普必赢?不是说希拉莉的赢率是 95% 吗?

南山/邓保军: 不是wei说的

我:
这叫横插一杠子。川普要赢,我去跳河。。。

毛:
哦,伟哥是在转述。

我:
跳河是玩笑了,我移民回加拿大总是可以吧。

李:
韩国这个料就爆得好。希拉里在关键时刻,也有可能爆大料

我:
问题是谁爆谁的料。两人都到了最后的时刻,似乎能找到的爆料也都差不多用了。再不用就不赶趟了。很多地方的提早投票都已经开始了,有杀手锏最多再等两三天是极限了,要给媒体和普罗一个消化和咀嚼的时间。

毛:
@wei 但是老印的那个系统并非专为本届大选而开发,并且说是已经连续报准了三届呀?

我:
我的也不是专为大选开发的呀。而且上次奥巴马决定用我们,你看他就赢了,我们也助了一臂之力呢。

毛:
你们两家的配方不同?

我:
奥巴马团队拥抱新技术,用舆情挖掘帮助监测调整竞选策略,这个比预测牛一点点吧。预测是作为 outsider 来赌概率。我这个是 engage in the process、技术提供助力 呵呵。当时不允许说的。

李:
奥巴马有可能会去硅谷打工唉

毛:
是否在舆情之外还有什么因素?

李:
原来你那个奥巴马照片不是蜡像呀

我:
假做真时真亦假呀

002_510_image

 

【相关】

【社煤挖掘:为什么要选ta而不是ta做总统?】

Big data mining shows clear social rating decline of Trump last month

Clinton, 5 years ago. How time flies …

【社媒挖掘:川大叔喜大妈谁长出了总统样?】

【川普和希拉里的幽默竞赛】

【大数据舆情挖掘:希拉里川普最近一个月的形象消长】

欧阳峰:论保守派该投票克林顿

【立委科普:自动民调】

【关于舆情挖掘】

《朝华午拾》总目录

 

 

 

 

【社煤挖掘:为什么要选ta而不是ta做总统?】

中文社煤挖掘美国大选的华人舆情,接着练。

Why and why not Clinton/Trump?

Why 喜大妈?Why 川大叔?Why not Clinton? Why not Trump?这是大选的首要问题,也是我们舆情挖掘想要探究的重点。Why???

First, why Clinton and why not Clinton? 看看喜大妈在舆情中的优劣对比图(pros and cons)。

sentiment-drivers-33

why Clinton?剔除竞选表现优秀等等与总统辩论和 campaign 有关的好话(“领先”、“获胜”、“占上风”、“赢得”等)外,主要理由有:

1. 老练 强硬; 2. 乐观; 2. 清楚; 4 换发活力 谈笑风生; 5. 梦想共同市场

拿着放大镜,除了政治套话和谀辞外也没看到什么真正的亮点。舆情领先,只能说对手太差了吧。四年前与奥巴马竞争被甩出一条街去,那是遇到了真正的强手。

OK,why not Clinton?

1. 性侵 性骚扰 威胁(她丈夫做的好事,她来背黑锅,呵呵。照常理她是受害者,可以同情的,不料给同样管不住下半身的川普一抹黑,她倒成了性侵的帮凶,说是威胁被性侵的女性。最滑稽的是,川普自己的丑闻曝光,他却一本正经带了一帮前总统克林顿的绯闻女士开记者会,来抹黑自己的对手克林顿夫人。滑稽逆天了。)

2. 邮件门 曝光 泄密

3 竞选团队的不轨行为 操纵大选 作弊

4. 克林顿基金会的问题

5. 华尔街收费

6 健康问题

7 撒谎、可耻

8. 缺乏判断力

这些都不是新鲜事儿,大选以来已经炒了很久了,但比起她的长处(经验老练等少数几条),喜妈被抓住的辫子还真不少。再看网民的情绪性吐槽, 说好话都是相似的,坏话却各有不同:轻的是,“乏善可陈”、“不喜欢”、“不信任”; 重的是:“妖婆”,“婊子”、“灾难”、“无耻”、“邪恶”。

sentiment-drivers-34
作为对比,来看川大叔,why or why not Trump?

sentiment-drivers-35

pros:1. 减税;2. 承诺 崛起 (America great again);3. 真实;4. 擅长 business
cons:
1. 曝光的视频丑闻 性骚扰
2. 偷税漏税
3. 吹嘘
4 咄咄逼人 喜怒无常
5 粗鄙、威胁
6 撒谎

情绪性吐槽,轻的是 “不靠谱”、“出言不逊”,重的是 “恶心”、“愚蠢”、“卑劣”、“众叛亲离”。

sentiment-drivers-36
上篇中文社煤自动民调博文发了以后有朋友问,为什么不见大名鼎鼎的脸书。(微信不见可以理解,人家数据不对外开放,对隐私性特别敏感,比脸书严多了。不过,地球人都知道,反映我大唐舆情最及时精准的大数据宝库,非微信莫属)。查对了一下,上次做的中文舆情调查,不知何故 Facebook 不在 top 10,只占调查数据的 0.1%:

sources-9

记得以前的英语社煤调查,通常的比例是 70% twitter,20% Facebook, 其他所有论坛和社交媒体只占 10%。最近加了 instagram、Tumblr 等,格局似有变。但是中文在海外,除了推特,Facebook 本来应该有比重的,特别是我台湾同胞,用 Facebook 跟东土用微信一样普遍。

再看看这次调查的网民背景分类。

1.  职业是科技为主(大概不少是咱码农),其次才是新闻界和教育界。这些人喜欢到网上嚷嚷。

professions

这是他们的兴趣(interests),有意思的关联似乎是,喜欢谈政治的与喜欢谈宗教和美食的有相当大交集。

interests

这是年龄分组,分布比较均匀,但还是中青年为主。

age

性别不用说,男多女少。男人谈政治与女人谈shopping一样热心。

gender

最后看看地理分布,社煤的地理来源:
geo-regions

 

 

【相关】

【社媒挖掘:川大叔喜大妈谁长出了总统样?】

Big data mining shows clear social rating decline of Trump last month

【川普和希拉里的幽默竞赛】

【大数据舆情挖掘:希拉里川普最近一个月的形象消长】

论保守派该投票克林顿

【立委科普:自动民调】

【关于舆情挖掘】

《朝华午拾》总目录

Clinton, 5 years ago. How time flies ...

 311736_10150433966356900_893547465_n
克林顿白,立委黑,
“立委,你的头发太有个性了。”
无独有偶,黑白的对比引来大小的反差。居然比立委的头发还白,岂有此理

“好想知道 Clinton 在你背後說的是好話,還是壞話?”

【社媒挖掘:川大叔喜大妈谁长出了总统样?】

眼看决战时刻快到了,调查一下华人怎么看美国大选,最近一个月的舆情趋势。中文社会媒体对于美国总统候选人的自动调查。

aaa

先看喜大妈,是过去三十天的调查(时间区间:9/26-10/25)
summary-metrics-new-3
mentions 是热议度,net sentiment 是褒贬指数,反映的网民心目中的形象。

summary-metrics-6
很自然,二者并不总是吻合:譬如,在十月10日到11日的时候,希拉里被热议,而她的褒贬指数则跌入谷底。那天有喜大妈的什么丑闻吗?咱们把时间按周(by weeks)而不是按日来看 trends,粗线条看趋势也许更明显一些:

summary-metrics-7
Anyway,过去30天的总社煤形象分(net sentiment)是 11%,比起英语世界的冰点之下(-18%)好太多了,似乎华语世界远不如英语世界对老政客喜大妈的吐槽刻薄。

作为对比,我们看看川普(特朗普)在同一个时期的社会形象的消长趋势:川普过去30天的总社煤形象分(net sentiment)是 -12%,比希拉里的+11%成鲜明对比。

summary-metrics-8

看上面的趋势图(by weeks),川普的热议度一直居高不下,话题之王名副其实,但他的社会评价却一直在冰点之下,十月初更是跌入万丈深渊。同时期的希拉里,热议度与社会评价却时有交叉。趋势 by days:

summary-metrics-9

这样看来,虽然有所谓华人挺川的民间鼓噪,总体来看,川大叔在华人的网上口水战中,与喜大妈完全不是一个量级的对手。川普很臭,真地很臭。在英语社煤中,川普也很臭(-20%),但希拉里也不香,民间厌恶她诅咒她的说法随处可见,得分 -18%,略好于川普。譬如电邮门事件,很多老美对此深恶痛绝,不少华人(包括在下)心里难免觉得是小题大作。为什么华人世界对希拉里没有那么反感呢?居然给希拉里 +11% 的高评价。朋友说,希拉里更符合华人主流价值观吧。

这是我们的品牌对比图,三维直观地对比两位候选人在社煤的形象位置:

brand-passion-index-10

希拉里领先太多,虽然热议度略逊。

总有人质疑社煤挖掘的情报价值,说也许NLU不过关,挖掘有误呢。更多的质疑是,也许某党的人士更愿意搅浑水呢(譬如利用水军或机器人bots)。凡此总总,都给社会媒体舆情挖掘在多大程度上反映民意,提出了疑问和挑战。其实,对于传统的民调,不同的机构有不同的结果,加上手工民调的取样不可能大,error margin 也大。各机构结果也颇不同,所以大家也都是一肚子怀疑。不断有怀疑,还是不断有民调在进行。这是大选年的信息“刚需”吧。

所有的自动的或人工的民调,都可能有偏差,都只能做民意的参考。但是我要强调的是:

1. 现在的深度 NLU 支持的舆情挖掘,已经今非昔比,加上大数据信息冗余度的支撑,精准度在宏观上是可以保障的;

2. 全自动的社煤民调,其大数据的特性,是人工民调无法比的(时效以及costs也无法比,见【立委科普:自动民调】);

3. 虽然社煤上的口水、噪音以及不同党派或群体在其上的反映都可能有很大差异,但是社煤民调的消长趋势的情报以及不同候选人(或品牌)的对比情报,是相对可靠的。怎么讲?因为自动系统具有与生俱来的一视同仁性。

时间维度上的舆情消长,具有相对的比较价值,它基本不受噪音或其他因素的影响。也不大受系统数据质量的影响(当然,太臭的舆情系统也还是糊不上墙,跟抛硬币差不了太多的一袋子词这样的“主流”舆情分类,在短消息压倒多数的社会媒体,还是不要提了吧,见一切声称用机器学习做社会媒体舆情挖掘的系统,都值得怀疑)。

我们目前的系统,是 deep parsing 支持,本性是 precision 优于 recall(precision 不降低,recall 也可以慢慢爬上来,譬如我们的英语舆情系统就有相当好的recall,recall在符号逻辑路线里面,本质上就是开发时间的函数)。Given big data 这样的场景,recall 的某种缺失,其实并不影响舆情的相对意义,因为决定 recall 的是规则量,缺少的是一些长尾 pattern rules,而语言学的 rules 不会因为时间或候选人的不同,而有所不同。同理,因为系统的编制是独立于千变万化的候选人、品牌或话题,因此数据质量对于候选人之间的比较,是靠谱的。这样看,舆情趋势和候选人对比的情报挖掘,的确真实地反映了民意的消长和相对评价。下面是这次自动民调的 Top 10 数据来源(可惜没有“她”,我是说 wechat),还是最动态反映舆情的推特中文帖子占多数(其中 66% 简体,30% 繁体,4% 粤语)。

domains-5

看一下popular的帖子,居然小方的也在其列。倒也不怪,方在中文社煤还是有影响力的。

chuanpupopularposts

小方总结得不错啊,难得同意他:满嘴跑火车的川大叔是“谎言大王”。其实川普与其说是谎话连篇,不如说是他根本不care 或不屑去核对事实。就跟北京出租司机信口开河成为习惯一样,话说到这里,转一篇我的老友刚写的博文(论保守派该投票克林顿),quote:

川普说话不顾事实是众所周知的。只要他一开口,就忙坏了各种事实核查 fact check ......
更重要的是,川普不仅犯了大大小小众多的事实错误,而且对事实抱着强烈的轻蔑和鄙视。

总结一下这次民调的结果可以说,如果是华人投票,川普不仅是 lose 而是要死得很惨,很难看。(当然,不管华人与否,川普都没有啥胜算。)

timeline-comparison-12

这是 by days 的趋势对比,这种持续的舆情领先在大选前很难改变吧:

timeline-comparison-13

【更多美国大选舆情的自动调查还在进行整理中,stay tuned】

 

【相关】

【社煤挖掘:为什么要选ta而不是ta做总统?】

Big data mining shows clear social rating decline of Trump last month

【川普和希拉里的幽默竞赛】

【大数据舆情挖掘:希拉里川普最近一个月的形象消长】

论保守派该投票克林顿

【立委科普:自动民调】

【立委科普:舆情挖掘的背后】

【社媒挖掘:《品牌舆情图》的设计问题】

一切声称用机器学习做社会媒体舆情挖掘的系统,都值得怀疑

【关于舆情挖掘】

《朝华午拾》总目录

 

 

 

 

 

 

 

 

 

Big data mining shows clear social rating decline of Trump last month

Big data mining from last month' social media shows clear decline of Trump in comparison with Clinton

aaa

Our automatic big data mining for public opinions and sentiments from social media speaks loud and clear: Tump's social image sucks.

Look at last 30 days of social media on the Hillary and Trump's social image and standing in our Brand Passion Index (BPI) comparison chart below:

brand-passion-index-8

Three points to note:
1 Trump has more than twice buzz than Hillary in terms of social media coverage (the size of the circles indicates the degree of mentions);
2. The intensity of sentiments from the general public of netters is more intense for Chump than for Clinton: the Y-axis shows the passion intensity
3. The social ratings and images of the two are both quite poor, but Trump is more criticized in social: the X-axis of Net Sentiment shows the index social sentiment ratings.  Both are under freezing point (meaning more negative comments than positive).

If we want to automatically investigate the trend of the past month and their social images' ups and downs, we can have the data segmented into two or three segments.  Figure below shows the trends contrast of the first 15 days of social media data vs. the second 15 days of data in the 30-day period (up to 10/21/2016):

brand-passion-index-7

See, in the past month, with the presidential election debates and scandals getting attention, Trump's media image significantly deteriorated, represented by the public opinion circles shifting from the right on the X-axis to the left side (for dislike or hate sentiments: the lighter circle represents data older than the darker circle).  His social rating was clearly better than Hillary to start with and ended up worse than that of Hillary.  At the same time, Hillary's social media image has improved, the circle moves a bit from the left to right. Two candidates have always been below the freezing point, clearly shown in the figure, but just a month ago, Clinton was rated even lower than Trump in public opinions of the social media: it is not the people who like Trump that much, but the general public showed more dislike for Hillary for whatever reasons.

As seen, our BPI brand comparison chart attempts to visualize four-dimensional information:
1. net sentiment for social ratings on the X-axis;
2. the passion intensity of public sentiments on the Y-axis;
3. buzz circle size, representing mentions of soundbites;
4. The two circles of the same brands show the coarse-grained time dimension for general trends.

It is not very easy to represent 4 dimensions of analytics in a two-dimensional graph.  Hope the above attempt in our patented visualization efforts is insightful and not confusing.

If we are not happy with the divide-into-two strategy for one month of data to show the trends, how about cut them into three pieces?  Here is the Figure for .three circles in the time dimension.

brand-passion-index-6

We should have used different colors for the two political brands to make visualization a bit clearer.  Nevertheless, we see the trends for Clinton in her three circles of social media sentiments shifting from the lower left corner to the upper right in a zigzag path: getting better, then worse, and ended up with somewhere in between at this point (more exactly, up to the point of 10/21/2016). For the same 3 segments of data, Trump's (brand) image started not bad, then went slightly better, and finally fell into the abyss.

The above is to use our own brand comparison chart (BPI) to decode the two US presidential candidates' social images change and trends.  This analysis, entirely automated based on deep Natural Language Parsing technology, is supported by data points in a magnitude many times more than the traditional manual polls which are by nature severely restricted in data size and time response.

What are the sources of social media data for the above automated polling?  They are based on random social media sampling of big data, headed by the most dynamic source of Twitter, as shown below.

sources-5

sources-4

sources-3

This is a summary of the public opinions and sentiments:

%e5%b7%9d%e6%99%ae%e5%b8%8c%e6%8b%89%e9%87%8c

As seen, it is indeed BIG data: a month of random sampling of social media data involves the mentions of the candidates for nearly 200 million times, a total of up to 3,600+ billion impressions (potential eyeballs). Trump accounted for 70 percent of the buzz while Clinton only 30 percent.

The overall social rating during the period of 09/21/2016 through 10/21/2016, Trump's net sentiment is minus 20%, and Clinton is minus 18%.  These measures show a rating much lower than that of most other VIP analysis we have done before using the same calculations.  Fairly nasty images, really.   And the big data trends show that Trump sucks most.

The following is some social media soundbites for Trump:

Bill Clinton disgraced the office with the very behavior you find appalling in...
In closing, yes, maybe Trump does suffer from a severe case of CWS.
Instead, in this alternate NY Times universe, Trump’s campaign was falling ...
Russian media often praise Trump for his business acumen.
This letter is the reason why Trump is so popular
Trump won
I'm proud of Trump for taking a stand for what's right.
Kudos to Trump for speaking THE TRUTH!
Trump won
I’m glad I’m too tired to write Trump/Putin fuckfic.
#trump won
Trump is the reason Trump will lose this election.
Trump is blamed for inciting violence.
Breaking that system was the reason people wanted Trump.
I hate Donald Trump for ruining my party.
>>32201754 Trump is literally blamed by Clinton supporters for being too friendly with Russia.
Another heated moment came when Trump delivered an aside in reponse to ...
@dka_gannongal I think Donald Trump is a hoax created by the Chinese....
Skeptical_Inquirer The drawing makes Trump look too normal.
I'm proud of Donald Trump for answering that honestly!
Donald grossing me out with his mouth features @smerconish ...
Controlling his sniffles seems to have left Trump extraordinarily exhausted
Trump all the way people trump trump trump
Trump wins
Think that posting crap on BB is making Trump look ridiculous.
I was proud of Trump for making America great again tonight.
MIL is FURIOUS at Trump for betraying her!
@realdonaldTrump Trump Cartel Trump Cartel America is already great, thanks to President Obama.
Kudos to Mr Trump for providing the jobs!!
The main reason to vote for Trump is JOBS!
Yes donal trump has angered many of us with his WORDS.
Trump pissed off a lot of Canadians with his wall comments.
Losing this election will make Trump the biggest loser the world has ever seen.
Billy Bush's career is merely collateral damage caused by Trump's wrenching ..
So blame Donald for opening that door.
The most important reason I am voting for Trump is Clinton is a crook.
Trump has been criticized for being overly complimentary of Putin.
Kudos to Trump for reaching out to Latinos with some Spanish.
Those statements make Trump's latest moment even creepier.
I'm mad at FBN for parroting the anti-Trump talking points.
Kudos to Trump for ignoring Barack today @realDonaldTrump
Trump has been criticized for being overly complimentary of Putin.
OT How Donald Trump's rhetoric has turned his precious brand toxic via ...
It's these kinds of remarks that make Trump supporters look like incredible ...
Trump is blamed for inciting ethnic tensions.
Trump is the only reason the GOP is competitive in this race.
Its why Republicans are furious at Trump for saying the voting process is rigged.
Billy Bush’s career is merely collateral damage caused by Trump’s wrenching ..
Donald Trump is the dumbest, worst presidential candidate your country ...
I am so disappointed in Colby Keller for supporting Trump.
Billy Bush’s career is merely collateral damage caused by Trump’s wrenching..
In swing states, Trump continues to struggle.
Trump wins
Co-host Jedediah Bila agreed, saying that the move makes Trump look desperate.
Trump wins
"Trump attacks Clinton for being bisexual!"
TRUMP win
Pence also praised Trump for apologizing following the tape’s disclosure.
In swing states, Trump continues to struggle.
the reason Trump is so dangerous to the establishment is he is unapologetical..

Here are some public social media soundbites for Clinton in the same period:

Hillary deserves worse than jail.
Congratulations to Hillary & her campaign staff for wining three Presidential ..
I HATE @chicanochamberofcommerce FOR INTRODUCING THAT HILLARY ...
As it turns out, Hillary creeped out a number of people with her grin.
Hillary trumped Trump
Trump won!  Hillary lost
Hillary violated the Special Access Program (SAP) for disclosing about the ...
I trust Flint water more than Hillary
Hillary continued to baffle us with her bovine feces.
NEUROLOGISTS HATE HILLARY FOR USING THIS TRADE SECRET DRUG!!!!...
CONGRATULATIONS TO HILLARY CLINTON FOR WINNING THE PRESIDENCY
Supreme Court: Hillary is our only choice for keeping LGBT rights.
kudos to hillary for remaining sane, I'd have killed him by now
How is he blaming Hillary for sexually assaulting women. He's such a shithead
The only reason I'm voting for Hillary is that Donald is the only other choice
Hillary creeps me out with that weird smirk.
Hillary is annoying asf with all of her laughing
I credit Hillary for the Cubs waking up
When you listen to Hillary talk it is really stupid
On the other hand, Hillary Clinton has a thorough knowledge by virtue of ...
Americans deserve better than Hillary
Certain family members are also upset with me for speaking out against ...
Hillary is hated by all her security detail for being so abusive
Hillary beat trump
The only reason to vote for Hillary is she's a woman.
Certain family members are also upset with me for speaking out against ....
I am glad you seem to be against Hillary as well Joe Pepe.
Hillary scares me with her acions.
Unfortunately Wikileaks is the monster created by Hillary & democrats.
I'm just glad you're down with evil Hillary.
Hillary was not mad at Bill for what he did.  She was mad he got caught.  ......
These stories are falling apart like Hillary on 9/11
Iam so glad he is finally admitting this about Hillary Clinton.
Why hate a man for doing nothing like Hillary Clinton
Hillary molested me with a cigar while Bill watched.
You are upset with Hillary for doing the same as all her predecessors.
I feel like Hillary Clinton is God's punishment on America for its sins.
Trumps beats Hillary
You seem so proud of Hillary for laughing at rape victims.
Of course Putin is going to hate Hillary for publicly announcing false ...
Russia is pissed off at Hillary for blaming the for wikileaks!
Hillary will not win.  Good faith is stronger than evil.  Trump wins??
I am proud of Hillary for standing up for what is good in the USA.
Hillarys plans are worse than Obama
Hillary is the nightmare "the people" have created.
Funny how the Hillary supporters are trashing Trump for saying the same ...
???????????? I am so proud of the USA for making Hillary Clinton president.
Hillary, you're a hoax created by the Chinese
Trump trumps Hillary
During the debate, Trump praised Hillary for having the will to fight.
Trump is better person than Hillary
Donald TRUMPED Hillary
Kudos to Hillary for her accomplishments.
He also praised Hillary for handling the situation with dignity.
During the debate, Trump praised Hillary for having the will to fight.
People like Hillary in senate is the reason this country is going downhill.
Hillary did worse than expectations.
Trump will prosecute Hillary for her crimes, TRUMP will!
Have to praise Hillary for keeping her focus.
a landslide victory for Hillary will restore confidence in American democracy ..
I was so proud of Hillary tonight for acting like a tough, independent woman.
I dislike Hillary Clinton, as I think she is a corrupt, corporate shill.
Hillary did worse than Timmy Kaine
Im so glad he finally brought Benghazi against Hillary
Hillary, thank you for confirmation that the Wikileaks documents are authentic
Supreme Court justices is the only reason why I'd vote for Hillary.
Massive kudos to Hillary for keeping her cool with that beast behind her.
Congrats to Hillary for actually answering the questions. She's spot on. #debate

 

[Related]

Social media mining: Did Trump’s Gettysburg speech enable the support rate to soar as claimed?

Big data mining shows clear social rating decline of Trump last month

Clinton, 5 years ago. How time flies …

Automated Suevey

【川普和希拉里的幽默竞赛】

一觉醒来,大周末。看了川普和希拉里最后一场互撕后一起出席慈善晚会。

特朗普在慈善晚宴上自嘲

实拍特朗普希拉里互撕后 酒会竟然互相调侃

玩幽默都不是大师(两人均无法与奥巴马和比尔克林顿比),但也都及格了,可以给个赞。川普那张扑克牌脸和政治马戏团一般的连番表演,居然可以适合时宜地来点自嘲和棍中夹棒,希拉里也适合时宜地富于幽默感似的笑起来,与平时的高高在上一本正经成对比。这幽默喜剧都演得很辛苦,很认真,做政客真心不容易。

想象不出老头儿桑德斯在类似场合是不是也能在社会主义激情后面来点幽默,想破了头,也很难把忧国忧民的桑老与调侃幽默联系起来。

希拉里那一段也不俗,不过好像还没来得及有汉化版,youTube 在这里(需要翻墙):https://www.youtube.com/watch?v=HjPQ82vTaes

选票已经到手了,除了为了川普而选希拉里外,正在看那些个本地提案和从来就不认识的本地候选人:二号提案试图加消费税(把消费税加到天花板)来改善湾区日渐恶化的交通(包括类似地铁的轻轨在硅谷腹地的延伸),一号提案也是增加房产税来帮助无家可归者提供廉租屋,领导说,local 加税的一律说不(可是联邦选举中,希拉里明摆着要加税,而川普要减税,领导却坚决投希拉里)。c 和 d 提案最切身,就是家门口开门即红透半边天的 cupertino downtown 旁边,同时在苹果新总部旁边,有一个了无生气的 mall,眼看这块宝地要大热,开发商与民间组织打起来了。开发商要推倒重建,在建筑商业应用之上做一个巨大的空中花园,在沙漠天气营造一个休闲绿洲来吸引投票。

民间组织宣传 yes on C,no on D,开发商发动广告大战,宣传 yes on D no on C,针锋相对,煞是热闹。民间组织的宣传颇有效,说 C 也 pro-customer and D is pro-developer,无奸不商,无商不奸,美丽的空中花园的下面,是多少多少的商业店铺和巨大的利润,带来的是交通和教育问题,等等。总之是信服了领导,但说不服我。
对于一个过了九点就跟鬼城一样的硅谷腹地,缺少的就是人气。D 提案描画的远景就是人气,想想吧,10 年后的 苹果总部、新 Downown 以及空中花园,会是怎样一个聚集人气的所在。为这个,不能与领导保持一致。

D要建公寓楼出租,这是业主反对的主要原因。资本家追求利益最大化,捎带着建个花园,活跃了人气。旁边的那些 property value 将来还要疯涨,举步就是吃喝玩乐,有啥可抱怨的。人再多也比不上北京上海,多开个 school 把马路拓宽不就结了。
等将来有钱了就去买一间这样的公寓,养老甚好,不用开车,楼下就是一切。
当年在温哥华的 Burnaby 的 Metrotown 中心,就建了好多高层公寓,不少老华人就在里面养老,自得其乐,让人羡慕。

 

【大数据舆情挖掘:希拉里川普最近一个月的形象消长】

aaa

大数据舆情挖掘,看图说话。
先看近一个月来在社会媒体上的希拉里和川普的品牌形象对比图:

brand-passion-index-8

看点三:
1 川普的 buzz 大过 希拉里一倍多,川普是话题中心(圈的大小表明热议度)
2. 普罗对川普比对希拉里,情绪更趋激烈:表现在 Y 轴的 passion intensity 上
3. 两人总体都不讨人喜欢,川普更加让人厌恶,表现在 x 轴上的 Net Sentiment(也就是褒贬对比的度量)。两人都在冰点之下,社会媒体的形象不佳。

如果我们要自动调查过去一个月时间的趋向和形象消长,可以考虑把数据分割为两段或三段来看此消彼长,先一分为二来看图:

brand-passion-index-7

看到了吧,过去一个月,随着总统大选辩论和丑闻的揭示和宣传,川普的媒体形象显著恶化,表现在舆情圈圈从右(x轴上的右是评价度高 love like,左边是评价度低 hate dislike)向左的位移。本来评价度clearly比希拉里要好,终于比希拉里差了。同时,希拉里的社会媒体形象有所改善,圈圈在从左向右位移。两个人始终都是冰点以下,吐槽多于赞美,但是就在一个月前,还是喜妈更不受待见:不是民众更喜欢老川,而是普罗更厌恶喜妈。

这个品牌对比图示表达了四维信息:
1. net sentiment 评价度 x 轴
2. passion intensity 舆情烈度 y 轴
3. buzz 圈圈的大小,是热议度
4. 一分为二的两个圈是时间的粗线条切割的维度

在二维的图纸上,要表达四维的信息,的确不是很容易。

要是嫌第四维时间太粗线条,咱们一分为三看看:

brand-passion-index-6

三个圈,浓度的深浅表达的是时间的远近。当短短的一个月的时间,被一分为三的时候,我们看到了什么趋向呢?请注意颜色的深浅,对应的是时间的远近。我们看到,喜妈的三个圈圈是左下角到右上(还是visualization设计不到家,不同品牌应该用不同的颜色区分才好)。原来喜妈的评价是先好,后坏,最后回到中间。而老川在同一个时间点,是先中,后略好,最后跌入深渊。

以上是利用我们自创的品牌对比图(有美国专利的)来看候选人的形象消长。

社会媒体数据的来源呢?Twitter 为主:

sources-5

sources-4

sources-3

这是一个月来的舆情总结:

%e5%b7%9d%e6%99%ae%e5%b8%8c%e6%8b%89%e9%87%8c

的确是大数据了,一个月的随机的社会媒体数据样本里面,两人的 mentions 就有近两亿,眼球数共计高达3万6千亿。川普占7成,喜妈才三成。川普跟冰冰类似,都是话题之王。

总体社会评价,川普零下20%,喜妈零下18%。

下面是有关川普的社煤数据选摘:

Bill Clinton disgraced the office with the very behavior you find appalling in Trump.
In closing, yes, maybe Trump does suffer from a severe case of CWS.
Instead, in this alternate NY Times universe, Trump’s campaign was falling apart.
Russian media often praise Trump for his business acumen.
This letter is the reason why Trump is so popular
Trump won
I'm proud of Trump for taking a stand for what's right.
Kudos to Trump for speaking THE TRUTH!
Trump won
I’m glad I’m too tired to write Trump/Putin fuckfic.
#trump won
Trump is the reason Trump will lose this election.
Trump is blamed for inciting violence.
Breaking that system was the reason people wanted Trump.
I hate Donald Trump for ruining my party.
>>32201754 Trump is literally blamed by Clinton supporters for being too friendly with Russia.
Another heated moment came when Trump delivered an aside in reponse to a Clinton one-liner.
@dka_gannongal I think Donald Trump is a hoax created by the Chinese....
Skeptical_Inquirer The drawing makes Trump look too normal.
I'm proud of Donald Trump for answering that honestly!
Donald grossing me out with his mouth features @smerconish @realdonaldtrump
Controlling his sniffles seems to have left Trump extraordinarily exhausted
Trump all the way people trump trump trump
Trump wins
Think that posting crap on BB is making Trump look ridiculous.
I was proud of Trump for making America great again tonight.
MIL is FURIOUS at Trump for betraying her!
@realdonaldTrump Trump Cartel Trump Cartel America is already great, thanks to President Obama.
Kudos to Mr Trump for providing the jobs!!
The main reason to vote for Trump is JOBS!
Yes donal trump has angered many of us with his WORDS.
Trump pissed off a lot of Canadians with his wall comments.
Losing this election will make Trump the biggest loser the world has ever seen.
Billy Bush's career is merely collateral damage caused by Trump's wrenching migration.
So blame Donald for opening that door.
The most important reason I am voting for Trump is Clinton is a crook.
Trump has been criticized for being overly complimentary of Putin.
Kudos to Trump for reaching out to Latinos with some Spanish.
Those statements make Trump's latest moment even creepier.
I'm mad at FBN for parroting the anti-Trump talking points.
Kudos to Trump for ignoring Barack today @realDonaldTrump
Trump has been criticized for being overly complimentary of Putin.
OT How Donald Trump's rhetoric has turned his precious brand toxic via The Independent.
It's these kinds of remarks that make Trump supporters look like incredible idiots.
Trump is blamed for inciting ethnic tensions.
Trump is the only reason the GOP is competitive in this race.
Its why Republicans are furious at Trump for saying the voting process is rigged.
Billy Bush’s career is merely collateral damage caused by Trump’s wrenching migration.
Donald Trump is the dumbest, worst presidential candidate your country has EVER produced.
I am so disappointed in Colby Keller for supporting Trump.
Billy Bush’s career is merely collateral damage caused by Trump’s wrenching migration.
In swing states, Trump continues to struggle.
Trump wins
Co-host Jedediah Bila agreed, saying that the move makes Trump look desperate.
Trump wins
"Trump attacks Clinton for being bisexual!"
TRUMP win
Pence also praised Trump for apologizing following the tape’s disclosure.
In swing states, Trump continues to struggle.
the reason Trump is so dangerous to the establishment is he is unapologetically alpha.

关于希拉里的社会媒体样本数据摘选:

Hillary deserves worse than jail.
Congratulations to Hillary & her campaign staff for wining three Presidential debates.
I HATE @chicanochamberofcommerce FOR INTRODUCING THAT HILLARY GIF INTO MY LIFE
As it turns out, Hillary creeped out a number of people with her grin.
Hillary trumped Trump
Trump won!  Hillary lost
Hillary violated the Special Access Program (SAP) for disclosing about the nuclear weapons!!
I trust Flint water more than Hillary
Hillary continued to baffle us with her bovine feces.
NEUROLOGISTS HATE HILLARY FOR USING THIS TRADE SECRET DRUG!!!!...
CONGRATULATIONS TO HILLARY CLINTON FOR WINNING THE PRESIDENCY
Supreme Court: Hillary is our only choice for keeping LGBT rights.
kudos to hillary for remaining sane, I'd have killed him by now
How is he blaming Hillary for sexually assaulting women. He's such a shithead
The only reason I'm voting for Hillary is that Donald is the only other choice
Hillary creeps me out with that weird smirk.
Hillary is annoying asf with all of her laughing
I credit Hillary for the Cubs waking up
When you listen to Hillary talk it is really stupid
On the other hand, Hillary Clinton has a thorough knowledge by virtue of her tenure as Secretary of State.
Americans deserve better than Hillary
Certain family members are also upset with me for speaking out against Hillary.
Hillary is hated by all her security detail for being so abusive
Hillary beat trump
The only reason to vote for Hillary is she's a woman.
Certain family members are also upset with me for speaking out against Hillary.
I am glad you seem to be against Hillary as well Joe Pepe.
Hillary scares me with her acions.
Unfortunately Wikileaks is the monster created by Hillary & democrats.
I'm just glad you're down with evil Hillary.
Hillary was not mad at Bill for what he did.  She was mad he got caught.  Just like she is not ashamed of what she did she is angry she got caught.
These stories are falling apart like Hillary on 9/11
Iam so glad he is finally admitting this about Hillary Clinton.
Why hate a man for doing nothing like Hillary Clinton
Hillary molested me with a cigar while Bill watched.
You are upset with Hillary for doing the same as all her predecessors.
I feel like Hillary Clinton is God's punishment on America for its sins.
Trumps beats Hillary
You seem so proud of Hillary for laughing at rape victims.
Of course Putin is going to hate Hillary for publicly announcing false accusations.
Russia is pissed off at Hillary for blaming the for wikileaks!
Hillary will not win.  Good faith is stronger than evil.  Trump wins??
I am proud of Hillary for standing up for what is good in the USA.
Hillarys plans are worse than Obama
Hillary is the nightmare "the people" have created.
Funny how the Hillary supporters are trashing Trump for saying the same thing.
???????????? I am so proud of the USA for making Hillary Clinton president.
Hillary, you're a hoax created by the Chinese
Trump trumps Hillary
During the debate, Trump praised Hillary for having the will to fight.
Trump is better person than Hillary
Donald TRUMPED Hillary
Kudos to Hillary for her accomplishments.
He also praised Hillary for handling the situation with dignity.
During the debate, Trump praised Hillary for having the will to fight.
People like Hillary in senate is the reason this country is going downhill.
Hillary did worse than expectations.
Trump will prosecute Hillary for her crimes, TRUMP will!
Have to praise Hillary for keeping her focus.
a landslide victory for Hillary will restore confidence in American democracy vindicated
I was so proud of Hillary tonight for acting like a tough, independent woman.
I dislike Hillary Clinton, as I think she is a corrupt, corporate shill.
Hillary did worse than Timmy Kaine
Im so glad he finally brought Benghazi against Hillary
Hillary, thank you for confirmation that the Wikileaks documents are authentic and you did that tonight when you accused the Russians of hacking your servers!  We the people deserve better than you!
Supreme Court justices is the only reason why I'd vote for Hillary.
Massive kudos to Hillary for keeping her cool with that beast behind her.
Congrats to Hillary for actually answering the questions. She's spot on. #debate

 

【相关】

【关于舆情挖掘】

《朝华午拾》总目录

【语义计算:精灵解语多奇智,不是冤家不上船】

白:
“他分分钟就可以教那些不讲道理的人做人的道理。”

我:

1016a

一路通,直到最后的滑铁卢。
定语从句谓语是“做人”而不是“可以教”,可是定语从句【【可以教。。。的】道理】与 vp定语【【做人的】道理】,这账人是怎么算的?

白:
还记得“那个小集合”吗?sb 教 sb sth,坑已经齐活儿了
“道理”是一般性的,定语是谓词的话一定要隐含全称陈述,不能是所有坑都有萝卜的。当然这也是软性的。只是在比较中不占优而已。单独使用不参与比较就没事:“张三打李四的道理你不懂”就可以,这时意味着“张三打李四背后的逻辑你不懂”。
“他分分钟就可以把一个活人打趴下的道理我实在是琢磨不透。”这似乎可以。

我:
教 至少两个 subcats:
教 sb sth
教 sb todo sth

白:
这个可以有
刚刚看到一个标题起:没有一滴雨会认为自己制造了洪灾。
这个句法关系分析的再清楚,也解释不了标题的语义。

宋:
有意思。

我:
教他
教他做人
教他道理
教他做人的道理
教他的道理
教他做人的往事儿

这个 “道理” 和 “往事”,是属于同一个集合的,我们以前讨论过的那个集合,不参与定语从句成分的 head n。

白:

我:
这个集合里面有子集 是关于 info 的,包括 道理 新闻 公告 往事。。。

白:
但是于“道理”而言,坑不满更显得有抽象度。是没“提取”,但坑不满更顺更优先,因为隐含了全称量词。

我:
就是说 这个集合里面还有 nuances 需要照顾。滑铁卢就在 “教他做人的往事儿” 上,照顾了它 就照顾不了 “做人的道理”。
就事论事 我可以词典化 “做人的道理”,后者有大数据的支持。

白:
这可是能产的语言现象。
试试这个:“你们懂不懂做人要低调的道理?”

我:
我试试 人在外 但电脑带了 只好拍照了

371656522530864097

你们懂不懂道理,这是主干
什么道理?
要低调的道理。
谁要低调?
你们。
懂什么类型的道理?
做人的道理。
谁做人?
你们。
小小的语义计算图谱 ,能回答这么多问题 ,这机器是不是有点牛叉?

白:
图上看,“要低调”是“懂道理”的状语而不是“道理”的定语?

我:
这个是对的,by design。但我们设计vn合成词的时候,我们要求把分离词合成起来。如果 n 带有定语,合成以后就指向 合成词整体。这时候 为了留下一些痕迹,有意在系统内部 保留定语的标签,以区别于其他的动词的状语修饰语。否则,“懂【要低调的】道理” 与 “【要低调的】懂道理”,就无法区分了。这样处理 语义落地有好处 完全是系统内部的对这种现象的约定和协调 system internal。定语 状语 都是修饰语 大类无异。

白:
“做人要低调”是一个整体,被拆解了。逻辑似乎不对。
拆解的问题还没解决:不管x是谁,如果x做人,x就要低调。
两个x是受全称量词管辖的同一个约束变元。
@宋 早上您似乎对“没有一滴雨会认为自己制造了洪灾”这个例子有话要说?

宋:
@白硕 主要是觉得这句话的意思有意思。从语义分析看应该不难,因为这是一种模式:没有NP V。即任何x,若x属于NP,则否定V(x)。

白:
首先这是一个隐喻,雨滴是不会“认为”如何如何的,既然这样用,就要提炼套路,看把雨滴代换成什么:雨滴和洪水的关系,是天上的部分和地上的整体的关系,是无害无责任的个体和有害有责任的整体的关系。

“美国网约车判决给北上广深的启示”

洪:
中土NLP全家福,
烟台开会倾巢出。
语言架桥机辅助,
兵强马壮数据足。

chinanlp
中国nlp全家福啊@wei

白: 哈
李白无暇混贵圈,一擎核弹一拨弦。精灵解语多奇智,不是冤家不上船。

洪:
冤家全都上贼船,李白有事别处赶。天宫迄今无甚关,Alien语言亟需练。

我:
白老师也没去啊 敢情。
黑压压一片 吾道不孤勒。

 

【相关】

【李白对话录:RNN 与语言学算法】

【李白对话录:如何学习和处置“打了一拳”】

【李白对话录:你波你的波,我粒我的粒】

【李白对话录- 从“把手”谈起】

【李白对话录:如何学习和处置“打了一拳”】 

【李白对话录之六:NLP 的Components 及其关系】

乔姆斯基批判

[转载]【白硕 – 穿越乔家大院寻找“毛毛虫”】

泥沙龙笔记:parsing 是引擎的核武器,再论NLP与搜索

泥沙龙笔记:从 sparse data 再论parsing乃是NLP应用的核武器

【立委科普:NLP核武器的奥秘】

【立委科普:语法结构树之美】

【立委科普:语法结构树之美(之二)】

中文处理

Parsing

【置顶:立委NLP博文一览】

《朝华午拾》总目录

 

【泥沙龙笔记:社会财富过个家家?】

我:
名人大嘴,见怪不怪了。
董老师一直在批评李彦宏的忽悠,说什么机器翻译要取代人的翻译。比起下面这个,是小巫见大巫吧, quote:

她想要和小扎一起,着手于“未来100年攻克所有疾病”的伟大理想。

【普莉希拉·陈落泪演讲】今天,扎克伯格和妻子陈宣布在未来10年捐出30亿美元协助疾病研究。陈在演讲中,回忆了自己贫穷的童年——作为一个华裔越南难民的女儿,想不到有一天竟开始有了改变世界的能力。她想要和小扎一起,着手于“未来100年攻克所有疾病”的伟大理想。L秒拍视频 @时差视频

声明一下:很钦佩,很感动,为小札和他妻子的赤心。可后者是医学博士啊,不仅仅是攻克所有疾病,而且给了时间期限,起点就是手里的钱和一片赤心。

没人觉得这个有些 way carried away 么?

明天我有钱了,我就宣布 200 年内,破解长生不老的千古之谜,实现秦始皇以来的人类最伟大的生命理想。

Mai:
@wei 金钱万能教
Lots of followers

我:
仔细看那个令人感动落泪的新闻发布会,医学博士也不是白当的,里面提到了一些“科学”。在那些术语的背后就是,医学革命不是不能解决癌症和其他绝症,而是缺乏经费,缺乏合作,缺乏原子弹一样的大项目。

洪:
语不惊人死不休,
有钱都想挣眼球。
伟爷何日高处就,
同样情怀也会有。

我:
现如今,小札和妻子有钱了,可以为这场革命发放启动资金。这么宏伟的目标,而且一两代人就可以完成,值得全世界政府和慈善家持续跟进。他们的角色就是做慈善天使吧?
标题是:Can we cure all diseases in our children's life time?
如果我说:no,这是骗人的大忽悠。是不是政治不正确,会被口水骂死?

洪:
追求不朽得不朽,
如此幻觉傻逼有。
凡人嗑药也喝酒,
富豪用钱㤰到头。

我:
更深一层的问题是,这些钱是他们的吗?由得他们胡来吗?

Mai:
和鲁迅先生所说,那个贺喜的客人说“孩子终究会死的”一样不受待见

我:
全世界的社会财富在一个确定的时间点是一个定数(按照我信奉的社会主义理论,社会财富乃全民所有,因为地球只有一个,因为人人生而平等,先不论)。这个财富交给大政府,通过税收,我们还是不放心,那会导致好大喜功的社会主义。所以要求减税,要求小政府,要求市场经济,指望万能的资本主义商品经济中“看不见的手”。但是流落到富豪手中的那一部分,则可变成为做慈善家而来的任性行为。

Mai:
病由心生,想用钱买健康,和始皇帝追求长生,智慧水准相若。

我:
谁来规范和justify这个花费?为什么胡乱或任性的巨额花费可以得到免税的政府扶持和社会的喝彩?

在所有富豪中,小札伉俪其实是我最喜欢的,简直就是孩子一样,童贞可爱。可是巨额财富落到孩子手上,简直比落到政府手中,更让人惊悚。一样是民脂民膏。很可能就被孩子过家家了。

细思极恐。

社会有反托拉斯法,理想的社会也应该有 反巨额财富spending法 去规范约束暴发户的任性行为。

RW:
@wei "War on Cancer" 是好事啊 。。。伟哥怎么啦?

我:
这个世界是钱太少,好事太多。风投还要有个 due diligence,这么大的 war 谁给做的 due diligence?好事多着呢。

RW:
比如说。。。?

我:
比如说:10x希望工程
100x红十字
1000x教育免费

如今政府难以取信于民了,红十字名声也臭了, 就暴发户花钱做慈善,还没臭,yet

廖:
尼克松搞过一个war on cancer 的项目,最后失败不了了之,浪费了无数老百姓的钱

我:
小札最好把钱给奥巴马或克林顿。专款专用,支持全民健保。不要让这天大的好事流产了 才是正道。唯一的超级大国,一个全民健保都搞不成,还谈什么攻克所有疾病?

100 年后,没有疾病了,这个日子还怎么过?所有的医学院都要关门,医生都要失业。失去疾病的同时,也失去了康复的指望。就如没有了死亡,也只有承受永远的生命之苦,煎熬永无止境,人生永世不得翻身。

四个字:细思极恐。

RW:
@廖 您是高手!
@wei
细思诚可贵,
极恐没必要。
若得长生乐,
两者皆可抛。

廖:
没有了疾病会有新的烦恼,这个大可不必担心。随着社会的发展某个行业逐渐消亡也是常有的事。
李瑞@全球鹰网络科技:
人生八大苦:生、老、病、死、爱别离、怨憎恚、求不得、五阴炽盛。
生老病死其实不苦,苦的是,因躁动的心所生出的痴心怨念。
爱却别离,于是忧愁怨恨滋生;
求而不得,于是恩怨情仇牵扯;
于是五阴炽盛:纷扰不断,皆源心乱。

洪:
冰冰年轻圆圆老,
伟哥也已伟爷瞧。
富起幻觉不想翘,
试用钱财打水飘。

Nick:
@wei 你到底哪活的,钱给政府也不行,自己造也不行?都给你做parser?

我:
@Nick  美国有很好的制度,使得暴发户不能变成世袭的贵族,“逼迫”他们把 90%+ 的钱财回馈社会,给他们一个慈善家的虚荣。
可是这个制度有一个重大的缺陷,就是慈善项目的 justification
任何spending,都必须有一个程序,现在是这个程序不到位,从而鼓励了财富任性。"zao" unchecked 也是犯罪。

当然就跟日本五代机一样,钱砸进去了,终归会有科技进步的。最后是 VOI 的问题。

毛:
按伟哥高见,私人如何花钱得要经过公民投票?
或者成立一个国家计委加以统筹?
又见《通往奴役之路》。

我:
对啊 当钱越过一个 thresholds 以后,那钱就不是私人的了。这时候 花钱的权利应该转向社会。任由私人的任性,无论出于多么善良的或虚荣的动因, 都是对人类资源的铺张浪费。就是某种制度缺失造成的合法犯罪。

毛:
这个threshhold怎么定?

行:
当美貌越个某个阈值是不是应被共妻?
私人财产已经被税收二次调整后就应自主支配,除了危害人类。

毛:
计划经济好?

我:
计划经济也许不好,但私人任性不比计划经济好。计划经济下 还可以有个制度性监管的空间。私人任性连这个空间都没有。

毛:
哦,那应该公私合营,社会主义改造,二次土改?

我:
小札的一百亿也许是任性,但也是唤醒

毛:
行,你是计划经济派

南:
不犯法即可

我:
税收是一个手段,但还是止不住任性挥霍

行:
按伟爷的理,您的财富远远超过全球的平均,是不是象那个老毛在湖南农考号召的,可以您家来搬东西?

我:
挥霍的背后就是不平等。

毛:
机会平等还是结果平等?

行:
全球还有几亿人赤贫饥饿,您经常晒美食算不算挥霍?是不是赞成穆加贝大爷把农场分给老兵

毛:
最不虚伪的就是把你的钱交公

我:
在资源总量恒定的情况下,一个项目的任性意味着其他项目的被剥夺。
每个项目后面都是人命。救了张三救不了李四。这个救谁 救多少的决定,无论如何不该是任性的私人决策。本质上与独裁者的长官意志,形象工程 ,并无二致。

行:
我坚定地站在这位老毛一边,坚决反对任何通往奴役的道路。

毛:
你的项目后面也有人命?

行:
伟爷,您的美食后面也是人命。
无论如何都该是任性的私人决策!
独裁者是剥夺。明抢!

我:
行者 我懂得你背后的逻辑,都是那个背景出来的。

毛:
好吧,说是社科院要重启计划经济研究,伟哥大有用武之地。

我:
你的通向奴役的说法我,混淆了度的概念。
如果是几个 million,或几十个 millions,fine,任性就任性。
如果是几百个亿 就不是一回事儿了

毛:
这些理论,我们从列宁那里听多了。

行:
当二次分配后的私人财富任由伟爷般的公意支配后,美国会变成天堂般的朝鲜

我:
这并不是说小扎这笔挥霍一定不对,也许歪打正着,也是可能的。 但正常理性的社会是不允许这样的。

行:
这个度站在津巴布韦老兵,站在陕北土坡的二流子来看呢?

毛:
为什么他会有几百个亿?
好吧,这个题目太大了,伟哥你自说自话吧。

我:
为什么有几百个亿?这是好问题。
因为他绝顶聪明,凭空创造出来的?
骗鬼吧。

行:
我们不怕因为有钱而任性而有权,我们怕因为任性的权力而有钱!

我:
他要是在月球创造了几百几千亿财富,爱咋玩咋玩。
他在地球赚钱,就得受到地球和地球人的束缚。

我:
共产主义破产。但共产主义与独裁计划经济的破产,并不自动为现存制度背书。

毛:
需要公民投票的不是他如何花钱,而是你的这些主张。好吧,stop。

行:
@wei 只是恐惧这个逻辑。
我觉得可以建议,呼吁。但权力仍归小札。

缘:
问题是他每次都出卖自己,把自己卖出一个好价格,交易自己。制度保证自由出卖自己。

我:
行者 我们讨论的是不同层面的问题。

行:
你推崇的集权就可以是2000亿造加速器但还在希望工程。

我:
最后一句 假如不是几百亿,而是再高几个量级呢?

行:
有钱就是可以任性。咱有钱 了买两碗豆浆,吃一碗倒一碗

我:
咱也任性 晒晒今天的地球恩赐:秋夜喜雨 秋日喜晴。
787382176509905677

南:
应该检讨财富的再分配模式而不是侵害个人权力

我:
@南 对。现存的是合法的 不合理怎么办 再检讨修正。并不意味着一检讨就只有回到共产主义一途。

 

From IBM's Jeopardy robot, Apple's Siri, to the new Google Translate

Latest Headline News: Samsung acquires Viv, a next-gen AI assistant built by the creators of Apple's Siri.

Wei:
Some people are just smart, or shrewd, more than we can imagine.  I am talking about Fathers of Siri, who have been so successful with their technology that they managed to sell the same type of technology twice, both at astronomical prices, and both to the giants in the mobile and IT industry.  What is more amazing is, the companies they sold their tech-assets to are direct competitors.  How did that happen?  How "nice" this world is, to a really really smart technologist with sharp business in mind.

What is more stunning is the fact that, Siri and the like so far are regarded more as toys than must-carry tools, intended at least for now to satisfy more curiosity than to meet the rigid demand of the market.  The most surprising is that the technology behind Siri is not unreachable rocket science by nature,  similar technology and a similar level of performance are starting to surface from numerous teams or companies, big or small.

I am a tech guy myself, loving gadgets, always watching for new technology breakthrough.  To my mind, something in the world is sheer amazing, taking us in awe, for example, the wonder of smartphones when the iPhone first came out. But some other things in the tech world do not make us admire or wonder that much, although they may have left a deep footprint in history. For example, the question answering machine made by IBM Watson Lab in winning Jeopardy.  They made it into the computer history exhibition as a major AI milestone.  More recently, the iPhone Siri, which Apple managed to put into hands of millions of people first time for seemingly live man-machine interaction. Beyond that accomplishment, there is no magic or miracle that surprises me.  I have the feel of "seeing through" these tools, both the IBM answering robot type depending on big data and Apple's intelligent agent Siri depending on domain apps (plus a flavor of AI chatbot tricks).

Chek: @ Wei I bet the experts in rocket technology will not be impressed that much by SpaceX either,

Wei: Right, this is because we are in the same field, what appears magical to the outside world can hardly win an insider's heart, who might think that given a chance, they could do the same trick or better.

The Watson answering system can well be regarded as a milestone in engineering for massive, parallel big data processing, not striking us as an AI breakthrough. what shines in terms of engineering accomplishment is that all this happened before the big data age when all the infrastructures for indexing, storing and retrieving big data in the cloud are widely adopted.  In this regard, IBM is indeed the first to run ahead of the trend, with the ability to put a farm of servers in working for the QA engine to be deployed onto massive data.  But from true AI perspective, neither the Watson robot nor the Siri assistant can be compared with the more-recent launch of the new Google Translate based on neural networks.  So far I have tested using this monster to help translate three Chinese blogs of mine (including this one in making), I have to say that I have been thrown away by what I see.  As a seasoned NLP practitioner who started MT training 30 years ago, I am still in disbelief before this wonder of the technology showcase.

Chen: wow, how so?

Wei:  What can I say?  It has exceeded my imagination limit for all my dreams of what MT can be and should be since I entered this field many years ago.  While testing, I only needed to do limited post-editing to make the following Chinese blogs of mine presentable and readable in English, a language with no kinship whatsoever with the source language Chinese.

Question answering of the past and present

Introduction to NLP Architecture

Hong: Wei seemed frightened by his own shadow.Chen:

Chen:  The effect is that impressive?

Wei:  Yes. Before the deep neural-nerve age, I also tested and tried to use SMT for the same job, having tried both Google Translate and Baidu MT, there is just no comparison with this new launch based on technology breakthrough.  If you hit their sweet spot, if your data to translate are close to the data they have trained the system on, Google Translate can save you at least 80% of the manual work.  80% of the time, it comes so smooth that there is hardly a need for post-editing.  There are errors or crazy things going on less than 20% of the translated crap, but who cares?  I can focus on that part and get my work done way more efficiently than before.  The most important thing is, SMT before deep learning rendered a text hardly readable no matter how good a temper I have.  It was unbearable to work with.  Now with this breakthrough in training the model based on sentence instead of words and phrase, the translation magically sounds fairly fluent now.

It is said that they are good a news genre, IT and technology articles, which they have abundant training data.  The legal domain is said to be good too.  Other domains, spoken language, online chats, literary works, etc., remain a challenge to them as there does not seem to have sufficient data available yet.

Chen: Yes, it all depends on how large and good the bilingual corpora are.

Wei:  That is true.  SMT stands on the shoulder of thousands of professional translators and their works.  An ordinary individual's head simply has no way in  digesting this much linguistic and translation knowledge to compete with a machine in efficiency and consistency, eventually in quality as well.

Chen: Google's major contribution is to explore and exploit the existence of huge human knowledge, including search, anchor text is the core.

Ma: I very much admire IBM's Watson, and I would not dare to think it possible to make such an answering robot back in 2007.

Wei: But the underlying algorithm does not strike as a breakthrough. They were lucky in targeting the mass media Jeopardy TV show to hit the world.  The Jeopardy quiz is, in essence, to push human brain's memory to its extreme, it is largely a memorization test, not a true intelligence test by nature.  For memorization, a human has no way in competing with a machine, not even close.  The vast majority of quiz questions are so-called factoid questions in the QA area, asking about things like who did what when and where, a very tractable task.  Factoid QA depends mainly on Named Entity technology which was mature long ago, coupled with the tractable task of question parsing for identifying its asking point, and the backend support from IR, a well studied and practised area for over 2 decades now.  Another benefit in this task is that most knowledge questions asked in the test involve standard answers with huge redundancy in the text archive expressed in various ways of expressions, some of which are bound to correspond to the way question is asked closely.  All these factors contribute to IBM's huge success in its almost mesmerizing performance in the historical event.  The bottom line is, shortly after the 1999 open domain QA was officially born with the first TREC QA track, the technology from the core engine has been researched well and verified for factoid questions given a large corpus as a knowledge source. The rest is just how to operate such a project in a big engineering platform and how to fine-tune it to adapt to the Jeopardy-style scenario for best effects in the competition.  Really no magic whatsoever.

Google Translated from【泥沙龙笔记:从三星购买Siri之父的二次创业技术谈起】, with post-editing by the author himself.

 

【Related】

Question answering of the past and present

Introduction to NLP Architecture

Newest GNMT: time to witness the miracle of Google Translate

Dr Li’s NLP Blog in English

 

【泥沙龙笔记:从三星购买Siri之父的二次创业技术谈起】

最近新闻:【三星收购 VIV 超级智能平台,与 Siri 和 Google 展开智能助理三国杀

我:
人要是精明,真是没治。一个 Siri,可以卖两次,而且都是天价,都是巨头,并且买家还是对头,也是奇了。最奇的是,Siri 迄今还是做玩具多于实用,满足好奇心多于满足市场的刚性需求。最最奇的是,Siri 里面的奥妙并不艰深,有类似水平和技术的也不是就他一家。
世界上有些事儿是让人惊叹的,譬如当 iPhone 问世的时候。但有些事儿动静很大,也在历史上留下了很深的足迹,但却没有叹服的感受。譬如 IBM 花生的问答系统,NND,都进入计算机历史展览馆了,作为AI里程碑。再如 Siri,第一个把人机对话送到千家万户的手掌心,功不可没。但这两样,都不让人惊叹,因为感觉上都是可以“看穿”的东西。不似火箭技术那种,让人有膜拜的冲动。IBM 那套我一直认为是工程的里程碑,是大数据计算和operations的成就,并非算法的突破。

查:
@wei 呵呵 估计搞火箭的也看不上SpaceX

我: 那倒也是,内行相轻,自古而然,因为彼此都多少知底。

陈:
最近对Watson很感冒

我:
花生是在大数据架构热起来之前做成的。从这方面看,IBM 的确开风气之先,有能力把一个感觉上平平的核心引擎,大规模部署到海量数据和平行计算上。总之,这两样都不如最近测试谷歌MT给我的震撼大。谷歌的“神经”翻译,神经得出乎意表,把我这个30年前就学MT的老江湖也弄晕糊了,云里雾里,不得不给他们吹一次喇叭

陈: 咋讲

我:
还讲啥,我是亲手测试的。两天里面测试翻译了我自己的两篇博文:

【Question answering of the past and present】

Introduction to NLP Architecture

洪:
伟爷被自己的影子吓坏了。

陈:
效果奇好?

我:
是的。前神经时代我也测试过,心里是有比较的。天壤之别。
如果你撞上了他们的枪口,数据与他们训练的接近,谷歌MT可以节省你至少 80% 的翻译人工。80% 的时候几乎可以不加编辑,就很顺畅了。谁在乎 20% 以内的错误或其他呢,反正我是省力一多半了。最重要的是,以前用 MT,根本就不堪卒读,无论你多好的脾气。现在一神经,就顺溜多了。当然,我的 NLP 博文,也正好撞上了他们的枪口。

陈:
以后也可以parsing。试一些医学的

我:
据说,他们擅长 news,IT,technology,好像 法律文体 据说也不错。其他领域、口语、文学作品等,那就太难为它了。

陈:
有双语语料

我:
就是,它是在千万个专业翻译的智慧结晶上。人的小小的脑袋怎么跟它比拼时间和效率呢,拼得了初一,也熬不过15。

陈:
谷歌的重大贡献是发掘人类已经存在的知识。包括搜索,锚文本是核心.

马:
我挺佩服IBM的华生的,如果是我,绝不敢在2007年觉得能做出这么一个东西出来

我:
可是算法上看真地不需要什么高超。那个智力竞赛是唬人的,挑战人的记忆极限。对于机器是特别有利的。绝大多数智力竞赛问答题,都是所谓 factoid questions
主要用到的是早已成熟的 Named Entity 技术,加上 question 的有限 parsing,背后的支撑也就是 IR。恰好智力竞赛的知识性问题又是典型的大数据里面具有相当 redundancy 的信息。这种种给IBM创造了成功的条件。

1999 年开始 open domain QA 正式诞生,不久上面的技术从核心引擎角度就已经被验证。剩下的就是工程的运作和针对这个竞赛的打磨了。

 

【相关】

【问答系统的前生今世】

【Question answering of the past and present】

谷歌NMT,见证奇迹的时刻

Newest GNMT: time to witness the miracle of Google Translate

《新智元笔记:知识图谱和问答系统:开题(1)》 

《新智元笔记:知识图谱和问答系统:how-question QA(2)》 

【置顶:立委NLP博文】

 

【问答系统的前生今世】

立委按:自从 Siri 第一次把问答系统送到千万人的手掌心后,如今又出了微软小冰和小娜。其实,中外所有IT巨头都在这方面加大了投入。于是想到重发2011年的博文。

一 前生
传统的问答系统是人工智能(AI: Artificial Intelligence)领域的一个应用,通常局限于一个非常狭窄专门的领域,基本上是由人工编制的知识库加上一个自然语言接口而成。由于领域狭窄,词汇总量很有限,其语言和语用的歧义问题可以得到有效的控制。问题是可以预测的,甚至是封闭的集合,合成相应的答案自然有律可循。著名的项目有上个世纪60年代研制的LUNAR系统,专事回答有关阿波罗登月返回的月球岩石样本的地质分析问题。SHRDLE 是另一个基于人工智能的专家系统,模拟的是机器人在玩具积木世界中的操作,机器人可以回答这个玩具世界的几何状态的问题,并听从语言指令进行合法操作。
这些早期的AI探索看上去很精巧,揭示了一个有如科学幻想的童话世界,启发人的想象力和好奇心,但是本质上这些都是局限于实验室的玩具系统(toy systems),完全没有实用的可能和产业价值。随着作为领域的人工智能之路越走越窄(部分专家系统虽然达到了实用,基于常识和知识推理的系统则举步维艰),寄生其上的问答系统也基本无疾而终。倒是有一些机器与人的对话交互系统 (chatterbots)一路发展下来至今,成为孩子们的网上玩具(我的女儿就很喜欢上网找机器人对话,有时故意问一些刁钻古怪的问题,程序应答对路的时候,就夸奖它一句,但更多的时候是看着机器人出丑而哈哈大笑。不过,我个人相信这个路子还大有潜力可挖,把语言学与心理学知识交融,应该可以编制出质量不错的机器人心理治疗师。其实在当今的高节奏高竞争的时代,很多人面对压力需要舒缓,很多时候只是需要一个忠实的倾听者,这样的系统可以帮助满足这个社会需求。要紧的是要消除使用者“对牛弹琴”的先入为主的偏见,或者设法巧妙隐瞒机器人的身份,使得对话可以敞开心扉。扯远了,打住。)
二 重生
产业意义上的开放式问答系统完全是另一条路子,它是随着互联网的发展以及搜索引擎的普及应运而生的。准确地说,开放式问答系统诞生于1999年,那一年搜索业界的第八届年会(TREC-8:Text REtrieval Conference)决定增加一个问答系统的竞赛,美国国防部有名的DARPA项目资助,由美国国家标准局组织实施,从而催生了这一新兴的问答系统及其community。问答系统竞赛的广告词写得非常精彩,恰到好处地指出搜索引擎的不足,确立了问答系统在搜索领域的价值定位。记得是这样写的(大体):用户有问题,他们需要答案。搜索引擎声称自己做的是信息检索(information retrieval),其实检索出来的并不是所求信息,而只是成千上万相关文件的链接(URLs),答案可能在也可能不在这些文件中。无论如何,总是要求人去阅读这些文件,才能寻得答案。问答系统正是要解决这个信息搜索的关键问题。对于问答系统,输入的是问题,输出的是答案,就是这么简单。
说到这里,有必要先介绍一下开放式问答系统诞生时候的学界与业界的背景。
从学界看,传统意义上的人工智能已经不再流行,代之而来的是大规模真实语料库基础上的机器学习和统计研究。语言学意义上的规则系统仍在自然语言领域发挥作用,作为机器学习的补充,而纯粹基于知识和推理的所谓智能规则系统基本被学界抛弃(除了少数学者的执着,譬如Douglas Lenat 的 Cyc)。学界在开放式问答系统诞生之前还有一个非常重要的发展,就是信息抽取(Information Extraction)专业方向及其community的发展壮大。与传统的自然语言理解(Natural Language Understanding)面对整个语言的海洋,试图分析每个语句求其语义不同,信息抽取是任务制导,任务之外的语义没有抽取的必要和价值:每个任务定义为一个预先设定的所求信息的表格,譬如,会议这个事件的表格需要填写会议主题、时间、地点、参加者等信息,类似于测试学生阅读理解的填空题。这样的任务制导的思路一下子缩短了语言技术与实用的距离,使得研究人员可以集中精力按照任务指向来优化系统,而不是从前那样面面俱到,试图一口吞下语言这个大象。到1999年,信息抽取的竞赛及其研讨会已经举行了七届(MUC-7:Message Understanding Conference),也是美国DARPA项目的资助产物(如果说DARPA引领了美国信息产业研究及其实用化的潮流,一点儿也不过誉),这个领域的任务、方法与局限也比较清晰了。发展得最成熟的信息抽取技术是所谓实体名词的自动标注(Named Entity:NE tagging),包括人名、地名、机构名、时间、百分比等等。其中优秀的系统无论是使用机器学习的方法,还是编制语言规则的方法,其查准率查全率的综合指标都已高达90%左右,接近于人工标注的质量。这一先行的年轻领域的技术进步为新一代问答系统的起步和开门红起到了关键的作用。
到1999年,从产业来看,搜索引擎随着互联网的普及而长足发展,根据关键词匹配以及页面链接为基础的搜索算法基本成熟定型,除非有方法学上的革命,关键词检索领域该探索的方方面面已经差不多到头了。由于信息爆炸时代对于搜索技术的期望永无止境,搜索业界对关键词以外的新技术的呼声日高。用户对粗疏的搜索结果越来越不满意,社会需求要求搜索结果的细化(more granular results),至少要以段落为单位(snippet)代替文章(URL)为单位,最好是直接给出答案,不要拖泥带水。虽然直接给出答案需要等待问答系统的研究成果,但是从全文检索细化到段落检索的工作已经在产业界实行,搜索的常规结果正从简单的网页链接进化到 highlight 了搜索关键词的一个个段落。
新式问答系统的研究就在这样一种业界急切呼唤、学界奠定了一定基础的形势下,走上历史舞台。美国标准局的测试要求系统就每一个问题给出最佳的答案,有短答案(不超过50字节)与长答案(不超过250字节)两种。下面是第一次问答竞赛的试题样品:
Who was the first American in space?
Where is the Taj Mahal?
In what year did Joe DiMaggio compile his 56-game hitting streak?
三 昙花
这次问答系统竞赛的结果与意义如何呢?应该说是结果良好,意义重大。最好的系统达到60%多的正确率,就是说每三个问题,系统可以从语言文档中大海捞针一样搜寻出两个正确答案。作为学界开放式系统的第一次尝试,这是非常令人鼓舞的结果。当时正是 dot com 的鼎盛时期,IT 业界渴望把学界的这一最新研究转移到信息产品中,实现搜索的革命性转变。里面有很多有趣的故事,参见我的相关博文:《朝华午拾:创业之路》
回顾当年的工作,可以发现是组织者、学界和业界的天时地利促成了问答系统奇迹般的立竿见影的效果。美国标准局在设计问题的时候,强调的是自然语言的问题(English questions,见上),而不是简单的关键词 queries,其结果是这些问句偏长,非常适合做段落检索。为了保证每个问题都有答案,他们议定问题的时候针对语言资料库做了筛选。这样一来,文句与文本必然有相似的语句对应,客观上使得段落匹配(乃至语句匹配)命中率高(其实,只要是海量文本,相似的语句一定会出现)。设想如果只是一两个关键词,寻找相关的可能含有答案的段落和语句就困难许多。当然找到对应的段落或语句,只是大大缩小了寻找答案的范围,不过是问答系统的第一步,要真正锁定答案,还需要进一步细化,pinpoint 到语句中那个作为答案的词或词组。这时候,信息抽取学界已经成熟的实名标注技术正好顶上来。为了力求问答系统竞赛的客观性,组织者有意选择那些答案比较单纯的问题,譬如人名、时间、地点等。这恰好对应了实名标注的对象,使得先行一步的这项技术有了施展身手之地。譬如对于问题 “In what year did Joe DiMaggio compile his 56-game hitting streak?”,段落语句搜索很容易找到类似下列的文本语句:Joe DiMaggio's 56 game hitting streak was between May 15, 1941 and July 16, 1941.  实名标注系统也很容易锁定 1941 这个时间单位。An exact answer to the exact question,答案就这样在海量文档中被搜得,好像大海捞针一般神奇。沿着这个路子,11 年后的 IBM 花生研究中心成功地研制出打败人脑的电脑问答系统,获得了电视智能大奖赛 Jeopardy! 的冠军(见报道 COMPUTER CRUSHES HUMAN 'JEOPARDY!' CHAMPS),在全美观众面前大大地出了一次风头,有如当年电脑程序第一次赢得棋赛冠军那样激动人心。
当年成绩较好的问答系统,都不约而同地结合了实名标注与段落搜索的技术: 证明了只要有海量文档,snippet+NE 技术可以自动搜寻回答简单的问题。
四 现状
1999 年的学界在问答系统上初战告捷,我们作为成功者也风光一时,下自成蹊,业界风险投资商蜂拥而至。很快拿到了华尔街千万美元的风险资金,当时的感觉真地好像是在开创工业革命的新纪元。可惜好景不长,互联网泡沫破灭,IT 产业跌入了萧条的深渊,久久不能恢复。投资商急功近利,收紧银根,问答系统也从业界的宠儿变成了弃儿(见《朝华午拾 - 水牛风云》)。主流业界没人看好这项技术,比起传统的关键词索引和搜索,问答系统显得不稳定、太脆弱(not robust),也很难 scale up, 业界的重点从深度转向广度,集中精力增加索引涵盖面,包括所谓 deep web。问答系统的研制从业界几乎绝迹,但是这一新兴领域却在学界发芽生根,不断发展着,成为自然语言研究的一个重要分支。IBM 后来也解决了 scale up (用成百上千机器做分布式并行处理)和适应性培训的问题,为赢得大奖赛做好了技术准备。同时,学界也开始总结问答系统的各种类型。一种常见的分类是根据问题的种类。
我们很多人都在中学语文课上,听老师强调过阅读理解要抓住几个WH的重要性:who/what/when/where/how/why(Who did what when, where, how and why?).  抓住了这些WH,也就抓住了文章的中心内容。作为对人的阅读理解的仿真,设计问答系统也正是为了回答这些WH的问题。值得注意的是,这些 WH 问题有难有易,大体可以分成两类:有些WH对应的是实体专名,譬如 who/when/where,回答这类问题相对容易,技术已经成熟。另一类问题则不然,譬如what/how/why,回答这样的问题是对问答学界的挑战。简单介绍一下这三大难题如下。
What is X?类型的问题是所谓定义问题,譬如 What is iPad II? (也包括作为定义的who:Who is Bill Clinton?) 。这一类问题的特点是问题短小,除去问题词What与联系词 is 以外 (搜索界叫stop words,搜索前应该滤去的,问答系统在搜索前利用它理解问题的类型),只有一个 X 作为输入,非常不利于传统的关键词检索。回答这类问题最低的要求是一个有外延和种属的定义语句(而不是一个词或词组)。由于任何人或物体都是处在与其他实体的多重关系之中(还记得么,马克思说人是社会关系的总和),要想真正了解这个实体,比较完美地回答这个问题,一个简单的定义是不够的,最好要把这个实体的所有关键信息集中起来,给出一个全方位的总结(就好比是人的履历表与公司的简介一样),才可以说是真正回答了 What/Who is X 的问题。显然,做到这一步不容易,传统的关键词搜索完全无能为力,倒是深度信息抽取可以帮助达到这个目标,要把散落在文档各处的所有关键信息抽取出来,加以整合才有希望(【立委科普:信息抽取】)。
How 类型的问题也不好回答,它搜寻的是解决方案。同一个问题,往往有多种解决档案,譬如治疗一个疾病,可以用各类药品,也可以用其他疗法。因此,比较完美地回答这个 How 类型的问题也就成为问答界公认的难题之一。

Why 类型的问题,是要寻找一个现象的缘由或动机。这些原因有显性表达,更多的则是隐性表达,而且几乎所有的原因都不是简单的词或短语可以表达清楚的,找到这些答案,并以合适的方式整合给用户,自然是一个很大的难题。

可以一提的是,我来硅谷九年帮助设计开发 deploy 了两个产品,第一个产品的本质就是回答 How-question 的,第二个涉及舆情挖掘和回答舆情背后的 Why-question。问答系统的两个最大的难题可以认为被我们的深层分析技术解决了。

原文在:【立委科普:问答系统的前生今世】

【相关】

【Question answering of the past and present】

http://en.wikipedia.org/wiki/Question_answering

《新智元笔记:知识图谱和问答系统:开题(1)》 

《新智元笔记:知识图谱和问答系统:how-question QA(2)》 

【旧文翻新:金点子起家的老管家 Jeeves】

《新智元笔记:微软小冰,人工智能聊天伙伴(1)》 

《新智元笔记:微软小冰,可能的商业模式(2)》 

【立委科普:从产业角度说说NLP这个行当】

【Question answering of the past and present】

  1. A pre-existence

The traditional question answering (QA) system is an application of Artificial Intelligence (AI).  It is usually confined to a very narrow and specialized domain, which is basically made up of a hand-crafted knowledge base with a natural language interface. As the field is narrow, the vocabulary is very limited, and its pragmatic ambiguity can be effectively under control. Questions are highly predictable, or close to a closed set, the rules for the corresponding answers are fairly straightforward. Well-known projects in the 1960s include LUNAR, a QA system specializing in answering questions about the geological analysis on the lunar samples collected from the Apollo's landing on the Moon.  SHRDLE is another famous QA expert system in AI history, it simulates the operation of a robot in the toy building world. The robot can answer the question of the geometric state of a toy and listen to the language instruction for its operation.

These early AI explorations seemed promising, revealing a fairy-tale world of scientific fantasy, greatly stimulating our curiosity and imagination. Nevertheless, in essence, these are just toy systems that are confined to the laboratory and are not of much practical value. As the field of artificial intelligence was getting narrower and narrower (although some expert systems have reached a practical level, majority AI work based on common sense and knowledge reasoning could not get out beyond lab), the corresponding QA systems failed to render meaningful results. There were some conversational systems (chatterbots) that had been developed thus far and became children's popular online toys (I remember at one time when my daughter was young, she was very fond of surfing the Internet to find various chatbots, sometimes deliberately asking tricky questions for fun.  Recent years have seen a revival of this tradition by industrial giants, with some flavor seen in Siri, and greatly emphasized in Microsoft's Little Ice).

2. Rebirth

Industrial open-domain QA systems are another story, it came into existence with the development of the Internet boom and the popularity of search engines. Specifically, the open QA system was born in 1999, when the TREC-8 (Eighth Text Retrieval Conference) decided to add a natural language QA track of competition, funded by the US Department of Defense's DARPA program, administrated by the United States National Institute of Standards and Technology (NIST), thus giving birth to this emerging QA community.  Its opening remarks when calling for the participation of the competition are very impressive, to this effect:

Users have questions, they need answers. Search engines claim that they are doing information retrieval, yet the information is not an answer to their questions but links to thousands of possibly related files. Answers may or may not be in the returned documents. In any case, people are compelled to read the documents in order to find answers. A QA system in our vision is to solve this key problem of information need. For QA, the input is a natural language question, the output is the answer, it is that simple.

It seems of benefit to introduce some background for academia as well as the industry when the open QA was born.

From the academic point of view, the traditional sense of artificial intelligence is no longer popular, replaced by the large-scale corpus-based machine learning and statistical research. Linguistic rules still play a role in the field of natural language, but only as a complement to the mainstream machine learning. The so-called intelligent knowledge systems based purely on knowledge or common sense reasoning are largely put on hold by academic scholars (except for a few, such as Dr. Douglas Lenat with his Cyc). In the academic community before the birth of open-domain question and answering, there was a very important development, i.e. the birth and popularity of a new area called Information Extraction (IE), again a child of DARPA. The traditional natural language understanding (NLU) faces the entire language ocean, trying to analyze each sentence seeking a complete semantic representation of all its parts. IE is different, it is task-driven, aiming at only the defined target of information, leaving the rest aside.  For example, the IE template of a conference may be defined to fill in the information of the conference [name], [time], [location], [sponsors], [registration] and such. It is very similar to filling in the blank in a student's reading comprehension test. The idea of task-driven semantics for IE shortens the distance between the language technology and practicality, allowing researchers to focus on optimizing tasks according to the tasks, rather than trying to swallow the language monster at one bite. By 1999, the IE community competitions had been held for seven annual sessions (MUC-7: Seventh Message Understanding Conference), the tasks of this area, approaches and the then limitations were all relatively clear. The most mature part of information extraction technology is the so-called Named Entity (NE tagging), including identification of names for human, location, and organization as well as tagging time, percentage, etc. The state-of-the-art systems, whether using machine learning or hand-crafted rules, reached a precision-recall combined score (F-measures) of 90+%, close to the quality of human performance. This first-of-its-kind technological advancement in a young field turned out to play a key role in the new generation of open-domain QA.

In industry, by 1999, search engines had grown rapidly with the popularity of the Internet, and search algorithms based on keyword matching and page ranking were quite mature. Unless there was a methodological revolution, the keyword search field seemed to almost have reached its limit. There was an increasing call for going beyond basic keyword search. Users were dissatisfied with search results in the form of links, and they needed more granular results, at least in paragraphs (snippets) instead of URLs, preferably in the form of direct short answers to the questions in mind.  Although the direct answer was a dream yet to come true waiting for the timing of open-domain QA era, the full-text search more and more frequently adopted paragraph retrieval instead of simple document URLs as a common practice in the industry, the search results changed from the simple links to web pages to the highlighting of the keywords in snippets.

In such a favorable environment in industry and academia, the open-domain question answering came onto the stage of history. NIST organized its first competition, requiring participating QA systems to provide the exact answer to each question, with a short answer of no more than 50 bytes in length and a long answer no more than 250 bytes. Here are the sample questions for the first QA track:

Who was the first American in space?
Where is the Taj Mahal?
In what year did Joe DiMaggio compile his 56-game hitting streak?

3. Short-lived prosperity

What are the results and significance of this first open domain QA competition? It should be said that the results are impressive, a milestone of significance in the QA history. The best systems (including ours) achieve more than 60% correct rate, that is, for every three questions, the system can search the given corpus and is able to return two correct answers. This is a very encouraging result as a first attempt at an open domain system. At the time of dot.com's heyday, the IT industry was eager to move this latest research into information products and revolutionize the search. There were a lot of interesting stories after that (see my related blog post in Chinese: "the road to entrepreneurship"), eventually leading to the historical AI event of IBM Watson QA beating humans in Jeopardy.

The timing and everything prepared by then from the organizers, the search industry, and academia, have all contributed to the QA systems' seemingly miraculous results. The NIST emphasizes well-formed natural language questions as appropriate input (i.e. English questions, see above), rather than traditional simple and short keyword queries.  These questions tend to be long, well suited for paragraph searches as a leverage. For competition's sake, they have ensured that each question asked indeed has an answer in the given corpus. As a result, the text archive contains similar statements corresponding to the designed questions, having increased the odds of sentence matching in paragraph retrieval (Watson's later practice shows that from the big data perspective, similar statements containing answers are bound to appear in text as long as a question is naturally long). Imagine if there are only one or two keywords, it will be extremely difficult to identify relevant paragraphs and statements that contain answers. Of course, finding the relevant paragraphs or statements is not sufficient for this task, but it effectively narrows the scope of the search, creating a good condition for pinpointing the short answers required.  At this time, the relatively mature technology of named entity tagging from the information extraction community kicked in.  In order to achieve the objectivity and consistency in administrating the QA competition, the organizers deliberately select only those questions which are relatively simple and straightforward, questions about names, time or location (so-called factoid questions).  This practice naturally agrees with the named entity task closely, making the first step into open domain QA a smooth process, returning very encouraging results as well as a shining prospect to the world. For example, for the question "In what year did Joe DiMaggio compile his 56-game hitting streak?", the paragraph or sentence search could easily find text statements similar to the following: "Joe DiMaggio's 56 game hitting streak was between May 15, 1941 and July 16".  An NE system tags 1941 as time with no problem and the asking point for time in parsing the wh-phrase "in what year" is also not difficult to decode. Therefore, an exact answer to the exact question seems magically retrieved from the sea of documents to satisfy the user, like a needle found in the haystack. Following roughly the same approach, equipped with gigantic computing power for parallel processing of big data, 11 years later, IBM Watson QA beat humans in the Jeopardy live show in front of the nationwide TV audience, stimulating the entire nation's imagination with awe for this technology advance.  From QA research perspective, the IBM's victory in the show is, in fact, an expected natural outcome, more of an engineering scale-up showcase rather than research breakthrough as the basic approach of snippet + NE + asking-point has long been proven.

A retrospect shows that adequate QA systems for factoid questions are invariably combined with a solid Named Entity module and a question parser for identifying asking points.  As long as there is an IE-indexed big data behind, with information redundancy as its nature, factoid QA is a very tractable task .

4. State of the art

The year 1999 witnessed the academic community's initial success of the first open-domain QA track as a new frontier of the retrieval world.  We also benefited from that event as a winner, having soon secured a venture capital injection of $10 million from the Wall Street. It was an exciting time shortly after AskJeeves' initial success in presenting a natural language interface online (but they did not have the QA technology for handling the huge archive for retrieving exact answers automatically, instead they used human editors behind the scene to update the answers database).  A number of QA start-ups were funded.  We were all expecting to create a new era in the information revolution. Unfortunately, the good times are not long, the Internet bubble soon burst, and the IT industry fell into the abyss of depression.  Investors tightened their monetary operations, the QA heat soon declined to freezing point and almost disappeared from the industry (except for giants' labs such as IBM Watson; in our case, we shifted from QA to mining online brand intelligence for enterprise clients). No one in the mainstream believes in this technology anymore. Compared with traditional keyword indexing and searching, the open domain QA  is not as robust and is yet to scale up to really big data for showing its power. The focus of the search industry is shifting from depth back to breadth, focusing on the indexing coverage, including the so-called deep web. As the development of QA systems is almost extinct from the industry, this emerging field stays deeply rooted in the academic community, developed into an important branch, with increasing natural language research from universities and research labs. IBM later solves the scale-up challenge, as a precursor of the current big data architectural breakthrough.

At the same time, scholars begin to summarize the various types of questions that challenge QA. A common classification is based on identifying the type of questions for their asking points.  Many of us still remember our high school language classes, where the teacher stressed the 6 WHs for reading comprehension: who / what / when / where / how / why. (Who did what when, where, how and why?)  Once answers to these questions are clear , the central stories of an article are in hands. As a simulation of human reading comprehension, the QA system is designed to answer these key WH questions as well. It is worth noting that these WH questions are of different difficulty levels, depending on the types of asking points (one major goal for question parsing is to identify the key need from a question, what we call asking point identification, usually based on question parsing of wh-phrases and other question clues). Those asking points corresponding to an entity as an appropriate answer, such as who / when / where, are relatively easy questions to answer (i.e. factoid questions). Another type of question is not simply answerable by an entity, such as what-is / how / why, there is consensus that answering such questions is a much more challenging task than factors questions.  A brief introduction to these three types of "tough" questions and their solutions are presented below as a showcase of the on-going state to conclude this overview of the QA journey.

What/who is X? This type of questions is the so-called definition question, such as What is iPad II? Who is Bill Clinton? This type of question is typically very short, after the wh-word and the stop word "is" are stripped in question parsing, what is left is just a name or a term as input to the QA system.  Such an input is detrimental to the traditional keyword retrieval system as it ends up with too many hits from which the system can only pick the documents with the most keyword density or page rank as returns.  But from QA perspective, the minimal requirement to answer this question is a definition statement in the forms of "X is a ...".  Since any entity or object is in multiple relationships with other entities and involved in various events as described in the corpus, a better answer to the definition question involves a summary of the entity with all the links to its key associated relations and events, giving a profile of the entity.  Such technology is in existence, and, in fact, has been partly deployed today. It is called knowledge graph, supported by underlying information extraction and fusion. The state-of-the-art solution for this type of questions is best illustrated in the Google deployment of its knowledge graph in handling queries of a short search for movie stars or other VIP.

The next challenge is how-questions, asking about a solution for solving a problem or doing something, e.g. How can we increase bone density? How to treat a heart attack?  This type of question calls for a summary of all types of solutions such as medicine, experts, procedures, or recipe.  A simple phrase is usually not a good answer and is bound to miss varieties of possible solutions to satisfy the information need of the users (often product designers, scientists or patent lawyers) who typically are in the stage of prior art research and literature review for a conceived solution in mind.  We have developed such a powerful system based on deep parsing and information extraction to answer open-domain how-questions comprehensively in the product called Illumin8, as deployed by Elsevier for quite some years.  (Powerful as it is, unfortunately, it did not end up as a commercial success in the market from revenue perspective.)

The third difficult question is why.  People ask why-questions to find the cause or motive of a phenomenon, whether an event or an opinion.  For example, why people like or dislike our product Xyz?  There might be thousands of different reasons behind a sentiment or opinion.   Some reasons are explicitly expressed (I love the new iPhone 7 because of its greatly enhanced camera) and more reasons are actually in some implicit expressions (just replaced my iPhone , it sucks in battery life).  An adequate QA system should be equipped with the ability to mine the corpus and summarize and rank the key reasons for the user.  In the last 5 years, we have developed a customer insight product that can answer why questions behind the public opinions and sentiments for any topics by mining the entire social media space.

Since I came to the Silicon Valley 9 years ago, I have been lucky, with pride, in having had a chance to design and develop QA systems for answering the widely acknowledged challenging questions.  Two products for answering the open-domain how questions and why-questions in addition to deep sentiment analysis have been developed and deployed to global customers.  Our deep parsing and IE platform is also equipped with the capability to construct deep knowledge graph to help answer definition questions, but unlike Google with its huge platform for the search needs, we have not identified a commercial opportunity to deploy that capability for a market yet.

This  piece of writing first appeared in 2011 in my personal blog, with only limited revisions since. Thanks to Google Translate at https://translate.google.com/ for providing a quick basis, which was post-edited by myself.  

 

[Related]

Http://en.wikipedia.org/wiki/Question_answering

The Anti-Eliza Effect, New Concept in AI

"Knowledge map and open-domain QA (1)" (in Chinese)

"knowledge map and how-question QA (2)"  (in Chinese)

Ask Jeeves and its million-dollar idea for human interface in 】(in Chinese)

Dr Li’s NLP Blog in English

 

Newest GNMT: time to witness the miracle of Google Translate

gnmt

Wei:
Recently, the microblogging (wechat) community is full of hot discussions and testing on the newest annoucement of the Google Translate breakthrough in its NMT (neural network-based machine translation) offering, claimed to have achieved significant progress in data quality and readability.  Sounds like a major breakthrough worthy of attention and celebration.

The report says:

Ten years ago, we released Google Translate, the core algorithm behind this service is PBMT: Phrase-Based Machine Translation.  Since then, the rapid development of machine intelligence has given us a great boost in speech recognition and image recognition, but improving machine translation is still a difficult task.

Today, we announced the release of the Google Neural Machine Translation (GNMT) system, which utilizes state-of-the-art training techniques to maximize the quality of machine translation so far. For a full review of our findings, please see our paper "Google`s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation."A few years ago, we began using RNN (Recurrent Neural Networks) to directly learn the mapping of an input sequence (such as a sentence in a language) to an output sequence (the same sentence in another language). The phrase-based machine learning (PBMT) breaks the input sentences into words and phrases, and then largely interprets them independently, while NMT interprets the entire sentence of the input as the basic unit of translation .

A few years ago, we began using RNN (Recurrent Neural Networks) to directly learn the mapping of an input sequence (such as a sentence in a language) to an output sequence (the same sentence in another language).  The phrase-based machine learning (PBMT) breaks the input sentences into words and phrases, and then largely interprets them independently, while NMT interprets the entire sentence of the input as the basic unit of translation .The advantage of this approach is that compared to the previous phrase-based translation system, this method requires less engineering design. When it was first proposed, the accuracy of the NMT on a medium-sized public benchmark

The advantage of this approach is that compared to the previous phrase-based translation system, this method requires less engineering design. When it was first proposed, the accuracy of the NMT on a medium-sized public benchmark data set was comparable to that of a phrase-based translation system.  Since then, researchers have proposed a number of techniques to improve NMT, including modeling external alignment models to handle rare words, using attention to align input and output words, and word decomposition into smaller units to cope with rare words. Despite these advances, the speed and accuracy of NMT has not been able to meet the requirements of a production system such as Google Translate.  Our new paper describes how to overcome many of the challenges of making NMT work on very large data sets and how to build a system that is both fast and accurate enough to deliver a better translation experience for Google users and services.

............

Using side-by-side comparisons of human assessments as a standard, the GNMT system translates significantly better than the previous phrase-based production system.  With the help of bilingual human assessors, we found in sample sentences from Wikipedia and the news website that GNMT reduced translational errors by 55% to 85% or more in the translation of multiple major pairs of languages.

In addition to publishing this research paper today, we have also announced that GNMT will be put into production in a very difficult language pair (Chinese-English) translation.

Now, the Chinese-English translations of the Google Translate for mobile and web versions have been translated at 100% using the GNMT machine - about 18 million translations per day.  GNMT's production deployment uses our open machine learning tool suite TensorFlow and our Tensor Processing Units (TPUs), which provide sufficient computational power to deploy these powerful GNMT models, meeting Google Translate strict latency requirements for products.

Chinese-to-English translation is one of the more than 10,000 language pairs supported by Google Translate. In the coming months, we will continue to extend our GNMT to far more language pairs.

GNMT translated from Google Translate achieves a major breakthrough!

As an old machine translation researcher, this temptation cannot be resisted.  I cannot wait to try this latest version of the Google Translate for Chinese-English.
Previously I tried Google Chinese-to-English online translation multiple times, the overall quality was not very readable and certainly not as good as its competitor Baidu.  With this newest breakthrough using deep learning with neural networks, it is believed to get close to human translation quality.  I have a few hundreds of Chinese blogs on NLP, waiting to be translated as a try.  I was looking forward to this first attempt in using Google Translate for my Science Popularization blog titled Introduction to NLP Architecture.  My adventure is about to start.  Now is the time to witness the miracle, if miracle does exist.

Dong:
I hope you will not be disappointed.  I have jokingly said before: the rule-based machine translation is a fool, the statistical machine translation is a madman, and now I continue to ridicule: neural machine translation is a "liar" (I am not referring to the developers behind NMT).  Language is not a cat face or the like, just the surface fluency does not work, the content should be faithful to the original!

Wei:
Let us experience the magic, please listen to this translated piece of my blog:

This is my Introduction to NLP Architecture fully automatically translated by Google Translate yesterday (10/2/2016) and fully automatically read out without any human interference.  I have to say, this is way beyond my initial expectation and belief.

Listen to it for yourself, the automatic speech generation of this science blog of mine is amazingly clear and understandable. If you are an NLP student, you can take it as a lecture note from a seasoned NLP practitioner (definitely clearer than if I were giving this lecture myself, with my strong accent). The original blog was in Chinese and I used the newest Google Translate claimed to be based on deep learning using sentence-based translation as well as character-based techniques.

Prof. Dong, you know my background and my original doubtful mindset. However, in the face of such a progress, far beyond our original imagination limits for automatic translation in terms of both quality and robustness when I started my NLP career in MT training 30 years ago, I have to say that it is a dream come true in every sense of it.

Dong:
In their terminology, it is "less adequate, but more fluent." Machine translation has gone through three paradigm shifts. When people find that it can only be a good information processing tool, and cannot really replace the human translation, they would choose the less costly.

Wei:
In any case, this small test is revealing to me. I am still feeling overwhelmed to see such a miracle live. Of course, what I have just tested is the formal style, on a computer and NLP topic, it certainly hit its sweet spot with adequate training corpus coverage. But compared with the pre-NN time when I used both Google SMT and Baidu SMT to help with my translation, this breakthrough is amazing. As a senior old school practitioner of rule-based systems, I would like to pay deep tribute to our "nerve-network" colleagues. These are a group of extremely genius crazy guys. I would like to quote Jobs' famous quotation here:

“Here's to the crazy ones. The misfits. The rebels. The troublemakers. The round pegs in the square holes. The ones who see things differently. They're not fond of rules. And they have no respect for the status quo. You can quote them, disagree with them, glorify or vilify them. About the only thing you can't do is ignore them. Because they change things. They push the human race forward. And while some may see them as the crazy ones, we see genius. Because the people who are crazy enough to think they can change the world, are the ones who do.”

@Mao, this counts as my most recent feedback to the Google scientists and their work. Last time, about a couple of months ago when they released their parser, proudly claimed to be "the most accurate parser in the world", I wrote a blog to ridicule them after performing a serious, apples-to-apples comparison with our own parser. This time, they used the same underlying technology to announce this new MT breakthrough with similar pride, I am happily expressing my deep admiration for their wonderful work. This contrast of my attitudes looks a bit weird, but it actually is all based on facts of life. In the case of parsing, this school suffers from lacking naturally labeled data which they would make use of in perfecting the quality, especially when it has to port to new domains or genres beyond the news corpora. After all, what exists in the language sea involves corpora of raw text with linear strings of words, while the corresponding parse trees are only occasional, artificial objects made by linguists in a limited scope by nature (e.g. PennTree, or other news-genre parse trees by the Google annotation team). But MT is different, it is a unique NLP area with almost endless, high-quality, naturally-occurring "labeled" data in the form of human translation, which has never stopped since ages ago.

Mao: @wei That is to say, you now embrace or endorse a neuron-based MT, a change from your previous views?

Wei:
Yes I do embrace and endorse the practice. But I have not really changed my general view wrt the pros and cons between the two schools in AI and NLP. They are complementary and, in the long run, some way of combining the two will promise a world better than either one alone.

Mao: What is your real point?

Wei:
Despite biases we are all born with more or less by human nature, conditioned by what we have done and where we come from in terms of technical background, we all need to observe and respect the basic facts. Just listen to the audio of their GSMT translation by clicking the link above, the fluency and even faithfulness to my original text has in fact out-performed an ordinary human translator, in my best judgment. If an interpreter does not have sufficient knowledge of my domain, if I give this lecture in a classroom, and ask an average interpreter to translate on the spot for me, I bet he will have a hard time performing better than the Google machine listed above (of course, human translation gurus are an exception). This miracle-like fact has to be observed and acknowledged. On the other hand, as I said before, no matter how deep the learning reaches, I still do not see how they can catch up with the quality of my deep parsing in the next few years when they have no way of magically having access to a huge labeled data of trees they depend on, especially in the variety of different domains and genres. They simply cannot "make bricks without straw" (as an old Chinese saying goes, even the most capable housewife can hardly cook a good meal without rice). Because in the natural world, there are no syntactic trees and structures for them to learn from, there are only linear sentences. The deep learning breakthrough seen so far is still mainly supervised learning, which has almost an insatiable appetite for massive labeled data, forming its limiting knowledge bottleneck.

Mao: I'm confused. Which one do you believe stronger? Who is the world's No. 0?

Wei:
Parsing-wise, I am happy to stay as No. 0 if Google insists on their being No. 1 in the world. As for MT, it is hard to say, from what I see, between their breakthrough and some highly sophisticated rule-based MT systems out there. But what I can say is, at a high level, the trends of the mainstream statistical MT winning the space both in the industry as well as in academia over the old school rule-based MT are more evident today than before.  This is not to say that the MT rule system is no longer viable, or going to an end. There are things which SMT cannot beat rule MT. For examples, certain types of seemingly stupid mistakes made by GNMT (quite some laughable examples of totally wrong or opposite translation have been illustrated in this salon in the last few days) are almost never seen in rule-based MT systems.

Dong:
here is my try of GNMT from Chinese to English:

学习上,初二是一个分水岭,学科数量明显增多,学习方法也有所改变,一些学生能及时调整适应变化,进步很快,由成绩中等上升为优秀。但也有一部分学生存在畏难情绪,将心思用在学习之外,成绩迅速下降,对学习失去兴趣,自暴自弃,从此一蹶不振,这样的同学到了初三往往很难有所突破,中考的失利难以避免。

Learning, the second of a watershed, the number of subjects significantly significantly, learning methods have also changed, some students can adjust to adapt to changes in progress, progress quickly, from the middle to rise to outstanding. But there are some students there is Fear of hard feelings, the mind used in the study, the rapid decline in performance, loss of interest in learning, self-abandonment, since the devastated, so the students often difficult to break through the third day,

Mao: This translation cannot be said to be good at all.

Wei:
Right, that is why it calls for an objective comparison to answer your previous question. Currently, as I see, the data for the social media and casual text are certainly not enough, hence the translation quality of online messages is still not their forte.  As for the previous textual sample Prof. Dong showed us above, Mao said the Google translation is not of good quality as expected. But even so, I still see impressive progress made there. Before the deep learning time, the SMT results from Chinese to English is hardly readable, and now it can generally be read loud to be roughly understood. There is a lot of progress worth noting here.

Ma:
In the fields with big data, in recent years, DL methods are by leaps and bounds. I know a number of experts who used to be biased against DL have changed their views when seeing the results. However, DL in the IR field is still basically not effective so far, but there are signs of slowly penetrating IR.

Dong:
The key to NMT is "looking nice". So for people who do not understand the original source text, it sounds like a smooth translation. But isn't it a "liar" if a translation is losing its faithfulness to the original? This is the Achille's heel of NMT.

Ma: @Dong, I think all statistical methods have this aching point.

Wei:
Indeed, there are respective pros and cons. Today I have listened to the Google translation of my blog three times and am still amazed at what they have achieved. There are always some mistakes I can pick here and there. But to err is human, not to say a machine, right? Not to say the community will not stop advancing and trying to correct mistakes. From the intelligibility and fluency perspectives, I have been served super satisfactorily today. And this occurs between two languages without historical kinship whatsoever.

Dong:
Some leading managers said to me years ago, "In fact, even if machine translation is only 50 percent correct, it does not matter. The problem is that it cannot tell me which half it cannot translate well. If it can, I can always save half the labor, and hire a human translator to only translate the other half." I replied that I am not able to make a system do that. Since then I have been concerned about this issue, until today when there is a lot of noise of MT replacing the human translation anytime from now. It's kinda like having McDonald's then you say you do not need a fine restaurant for French delicacy. Not to mention machine translation today still cannot be compared to McDonald's. Computers, with machine translation and the like, are in essence a toy given by God for us human to play with. God never agrees to permit us to be equipped with the ability to copy ourselves.

Why GNMT first chose language pairs like Chinese-to-English, not the other way round to showcase? This is very shrewd of them. Even if the translation is wrong or missing the points, the translation is usually fluent at least in this new model, unlike the traditional model who looks and sounds broken, silly and erroneous. This is the characteristics of NMT, it is selecting the greatest similarity in translation corpus. As a vast number of English readers do not understand Chinese, it is easy to impress them how great the new MT is, even for a difficult language pair.

Wei:
Correct. A closer look reveals that this "breakthrough" lies more on fluency of the target language than the faithfulness to the source language, achieving readability at cost of accuracy. But this is just a beginning of a major shift. I can fully understand the GNMT people's joy and pride in front of a breakthrough like this. In our career, we do not always have that type of moment for celebration.

Deep parsing is the NLP's crown. Yet to see how they can beat us in handling domains and genres lacking labeled data. I wish them good luck and the day they prove they make better parsers than mine would be the day of my retirement. It does not look anything like this day is drawing near, to my mind. I wish I were wrong, so I can travel the world worry-free, knowing that my dream has been better realized by my colleagues.

Thanks to Google Translate at https://translate.google.com/ for helping to translate this Chinese blog into English, which was post-edited by myself. 

 

[Related]

Wei’s Introduction to NLP Architecture Translated by Google

"OVERVIEW OF NATURAL LANGUAGE PROCESSING"

"NLP White Paper: Overview of Our NLP Core Engine"

Introduction to NLP Architecture

It is untrue that Google SyntaxNet is the "world’s most accurate parser"

Announcing SyntaxNet: The World’s Most Accurate Parser Goes Open

Is Google SyntaxNet Really the World’s Most Accurate Parser?

Dr Li's NLP Blog in English

【自然语言系统架构简说】

对于自然语言处理(NLP)及其应用,系统架构是核心问题,我在博文【立委科普:NLP 联络图 】里面给了四个NLP系统的体系结构的框架图,现在就一个一个做个简要的解说。

我把 NLP 系统从核心引擎直到应用,分为四个阶段,对应四张框架图。最底层最核心的是 deep parsing,就是对自然语言的自底而上层层推进的自动分析器,这个工作最繁难,但是它是绝大多数NLP系统基础技术。

parsing 的目的是把非结构的语言结构化。面对千变万化的语言表达,只有结构化了,patterns 才容易抓住,信息才好抽取,语义才好求解。这个道理早在乔姆斯基1957年语言学革命后提出表层结构到深层结构转换的时候,就开始成为(计算)语言学的共识了。结构树不仅是表达句法关系的枝干(arcs),还包括负载了各种信息的单词或短语的叶子(nodes)。结构树虽然重要,但一般不能直接支持产品,它只是系统的内部表达,作为语言分析理解的载体和语义落地为应用的核心支持。

接下来的一层是抽取层 (extraction),如上图所示。它的输入是结构树,输出是填写了内容的 templates,类似于填表:就是对于应用所需要的情报,预先定义一个表格出来,让抽取系统去填空,把语句中相关的词或短语抓出来送进表中事先定义好的栏目(fields)去。这一层已经从原先的领域独立的 parser 进入面对领域、针对应用和产品需求的任务了。

值得强调的是,抽取层是面向领域的语义聚焦的,而前面的分析层则是领域独立的。因此,一个好的架构是把分析做得很深入很逻辑,以便减轻抽取的负担。在深度分析的逻辑语义结构上做抽取,一条抽取规则等价于语言表层的千百条规则。这就为领域转移创造了条件。

有两大类抽取,一类是传统的信息抽取(IE),抽取的是事实或客观情报:实体、实体之间的关系、涉及不同实体的事件等,可以回答 who dis what when and where (谁在何时何地做了什么)之类的问题。这个客观情报的抽取就是如今火得不能再火的知识图谱(knowledge graph)的核心技术和基础,IE 完了以后再加上下一层挖掘里面的整合(IF:information fusion),就可以构建知识图谱。另一类抽取是关于主观情报,舆情挖掘就是基于这一种抽取。我过去五年着重做的也是这块,细线条的舆情抽取(不仅仅是褒贬分类,还要挖掘舆情背后的理由来为决策提供依据)。这是 NLP 中最难的任务之一,比客观情报的 IE 要难得多。抽取出来的信息通常是存到某种数据库去。这就为下面的挖掘层提供了碎片情报。

很多人混淆了抽取(information extraction) 和下一步的挖掘(text mining),但实际上这是两个层面的任务。抽取面对的是一颗颗语言的树,从一个个句子里面去找所要的情报。而挖掘面对的是一个 corpus,或数据源的整体,是从语言森林里面挖掘有统计价值的情报。在信息时代,我们面对的最大挑战就是信息过载,我们没有办法穷尽信息海洋,因此,必须借助电脑来从信息海洋中挖掘出关键的情报来满足不同的应用。因此挖掘天然地依赖统计,没有统计,抽取出来的信息仍然是杂乱无章的碎片,有很大的冗余,挖掘可以整合它们。

很多系统没有深入做挖掘,只是简单地把表达信息需求的 query 作为入口,实时(real time)去从抽取出来的相关的碎片化信息的数据库里,把 top n 结果简单合并,然后提供给产品和用户。这实际上也是挖掘,不过是用检索的方式实现了简单的挖掘就直接支持应用了。

实际上,要想做好挖掘,这里有很多的工作可做,不仅可以整合提高已有情报的质量。而且,做得深入的话,还可以挖掘出隐藏的情报,即不是元数据里显式表达出来的情报,譬如发现情报之间的因果关系,或其他的统计性趋势。这种挖掘最早在传统的数据挖掘(data mining)里做,因为传统的挖掘针对的是交易记录这样的结构数据,容易挖掘出那些隐含的关联(如,买尿片的人常常也买啤酒,原来是新为人父的人的惯常行为,这类情报挖掘出来可以帮助优化商品摆放和销售)。如今,自然语言也结构化为抽取的碎片情报在数据库了,当然也就可以做隐含关联的情报挖掘来提升情报的价值。

第四张架构图是NLP应用(apps)层。在这一层,分析、抽取、挖掘出来的种种情报可以支持不同NLP产品和服务。从问答系统到知识图谱的动态浏览(谷歌搜索中搜索明星已经可以看到这个应用),从自动民调到客户情报,从智能助理到自动文摘等等。

这算是我对NLP基本架构的一个总体解说。根据的是近20年在工业界做NLP产品的经验。18年前,我就是用一张NLP架构图忽悠来的第一笔风投,投资人自己跟我们说,这是 million dollar slide。如今的解说就是从那张图延伸拓展而来。

天不变道亦不变。

以前在哪里提过这个 million-dollar slide 的故事。说的是克林顿当政时期的 2000 前,美国来了一场互联网科技大跃进,史称  .com bubble,一时间热钱滚滚,各种互联网创业公司如雨后春笋。就在这样的形势下,老板决定趁热去找风险投资,嘱我对我们实现的语言系统原型做一个介绍。我于是画了下面这么一张三层的NLP体系架构图,最底层是parser,由浅入深,中层是建立在parsing基础上的信息抽取,最顶层是几类主要的应用,包括问答系统。连接应用与下面两层语言处理的是数据库,用来存放信息抽取的结果,这些结果可以随时为应用提供情报。这个体系架构自从我15年前提出以后,就一直没有大的变动,虽然细节和图示都已经改写了不下100遍了,本文的架构图示大约是前20版中的一版,此版只关核心引擎(后台),没有包括应用(前台)。话说架构图一大早由我老板寄送给华尔街的天使投资人,到了中午就得到他的回复,表示很感兴趣。不到两周,我们就得到了第一笔100万美金的天使投资支票。投资人说,这张图太妙了,this is a million dollar slide,它既展示了技术的门槛,又显示了该技术的巨大潜力。


from 科学网—前知识图谱钩沉: 信息抽取引擎的架构

【相关】

Introduction to NLP Architecture

【立委科普:NLP 联络图 】

前知识图谱钩沉: 信息抽取引擎的架构

【立委科普:自然语言parsers是揭示语言奥秘的LIGO式探测仪】 

【征文参赛:美梦成真】

《OVERVIEW OF NATURAL LANGUAGE PROCESSING》 

《NLP White Paper: Overview of Our NLP Core Engine》

White Paper of NLP Engine

【置顶:立委NLP博文】

Introduction to NLP Architecture

(translated by Google Translate, post-edited by myself)

For the natural language processing (NLP) and its applications, the system architecture is the core issue.  In my blog (  OVERVIEW OF NATURAL LANGUAGE PROCESSING), I sketched four NLP system architecture diagrams, now to be presented one by one .

In my design philosophy, an NLP process is divided into four stages, from the core engine up to the applications, as reflected in the four diagrams.  At the bottom is deep parsing, following the bottom-up processing of an automatic sentence analyzer.  This work is the most difficult, but it is the foundation and enabling technology for vast majority of NLP systems.

160213sg5p2r8ro18v17z8

The purpose of parsing is to structure unstructured text.  Facing the ever-changing language, only when it is structured in some logical form can we formulate patterns for the information we like to extract to support applications.  This principle of linguistics structures began to be the consensus in the linguistics community when Chomsky proposed the transformation from surface structure to deep structure in his linguistic revolution of 1957.  A tree representing the logical form does not only involve arcs that express syntactic-semantic relationships, but also contain the nodes of words or phrases that carry various conceptual information.  Despite the importance of such deep trees, generally they do not directly support an NLP product.  They remain only the internal representation of the parsing system, as a result of language analysis and understanding before its semantic grouding to the applications as their core support.

160216n8x8jj08qj2y1a8y

The next layer after parsing is the extraction layer, as shown in the above diagram.  Its input is the parse tree, and the output is the filled-in content of templates, similar to filling in a form: that is the information needed for the application, a pre-defined table (so to speak), so that the extraction system can fill in the blanks by the related words or phrases extracted from text based on parsing. This layer has gone from the original domain-independent parser into the application-oriented and product-demanded tasks.

It is worth emphasizing that the extraction layer is geared towards the domain-oriented semantic focus, while the previous parsing layer is domain-independent.  Therefore, a good framework is to do a very thorough analysis of logic semantics in deep parsing, in order to reduce the burden of information extraction.  With the depth of the analysis in  the logical semantic structures to support the extraction, a rule at extraction layer is in essence equivalent to thousands of surface rules at linear text layer.  This creates the conditions for the efficient porting to new domains based on the same core engine of parsing.

There are two types of extraction, one is the traditional information extraction (IE), the extraction of facts or objective information: named entities, the relationships between entities, and events involving entities (which can answer questions like "who did what when and where" and the like).  This extraction of objective information is the core technology and foundation for the knowledge graph (nowadays such a hot area in industry).  After completion of IE, the next layer of information fusion (IF) is aimed at constructing the knowledge graph.   The other type of extraction is about subjective information, for example, the public opinion mining is based on this kind of extraction. What I have done over the past five years as my focus is along this line for fine-grained extraction of public opinions (not just sentiment classification, but also to explore the reasons behind the public opinions and sentiments to provide the insights basis for decision-making).  This is one of the hardest tasks in NLP, much more difficult than IE for objective information.  Extracted information is usually stored in a database. This provides huge textual mentions of information to feed the underlying mining layer.

Many people confuse information extraction and text mining, but, in fact, they are two levels of different tasks.  Extraction faces each individual language tree, embodied in each sentence, in order to find the information we want.  The mining, however, faces a corpus, or data sources as a whole, from the language forest for gathering statistically significant insights.  In the information age, the biggest challenge we face is information overload, we have no way to exhaust the information ocean for the insights we need, therefore, we must use the computer to dig out the information from the ocean for the required critical intelligence to support different applications. Therefore, mining relies on natural statistics, without statistics, the information is still scattered across the corpus even if it is identified.  There is a lot of redundancy in the extracted mentions of information, mining can integrate them into valuable insights.

160215hzp5hq5pfd1alldj

Many NLP systems do not perform deep mining, instead, they simply use a query to search real-time from the extracted information index in the database and merge the retrieved information on-the-fly, presenting the top n results to the user. This is actually also mining, but it is a way of retrieval to achieve simple mining for directly supporting an application.

In order to do a good job of mining, there is a lot of work that can be done in this mining layer. Text mining not only improves the quality of existing extracted information pieces, moreover, it can also tap the hidden information, that is not explicitly expressed in the data sources, such as the causal relationship between events, or statistical trends of the public opinions or behaviours. This type of mining was first done in the traditional data mining applications as the traditional mining was aimed at structured data such as transaction records, making it easy to mine implicit associations (e.g., people who buy diapers often buy beer, this reflects the common behaviours of young fathers of the new-born, and such hidden association can be mined to optimize the layout and sales of goods). Nowadays, natural language is also structured thanks to deep parsing, hence data mining algorithms for hidden intelligence in the database can, in principle, also be applied to enhance the value of intelligence.

The fourth architectural diagram is the NLP application layer. In this layer, the results from parsing, extraction, and mining out of the unstructured text sources can be used to support a variety of NLP products and services, ranging from the QA (question answering) systems to the dynamic construction of the knowledge graph (this type of graph is visualized now in the Google search when we do a search for a star or VIP), from automatic polling of public opinions to customer intelligence about brands, from intelligent assistants (e.g. chatbots, Siri etc.) to automatic summarization and so on.

16221285l5wkx8t5ffi8a9

This is my overall presentation of the basic architecture of NLP and its applications, based on nearly 20 years of experiences in the industry to design and develop NLP products.  About 18 years ago, I was presenting a similar diagram of the NLP architecture to the first venture investor who told us that this is a million dollar slide.  The presentation here is a natural inheritance and extension from that diagram.

~~~~~~~~~~~~~~~~~~~
Here is the previously mentioned million-dollar slide story.  Under the Clinton's administration before the turn of the century, the United States went through a "great leap forward" of the Internet technology, known as Dot Com Bubble, a time of hot money pouring into the IT industry while all kinds of Internet startups were springing up.  In such a situation, my boss decided to seek venture capital for the business expansion, and requested me to illustrate our prototype of the implemented natural language system for its introduction.  I then drew the following three-tier structure of an NLP system diagram: the bottom layer is parsing, from shallow to deep, the middle is built on parsing for information extraction, and the top layer illustrates some major categories of NLP applications, including QA.  Connecting applications and the downstairs two layers of language processing is the database, used to store the results of information extraction, ready to be applied at any time to support upstairs applications.  This general architecture has not changed much since I made it years ago, although the details and layout have been redrawn no less than 100 times.  The architecture diagram below is about one of the first 20 editions, involving mainly the backend core engine of information extraction architecture, not so much on the front-end flowchart for the interface between applications and the database.  I still remember early in the morning, my boss sent the slide to a Wall Street angel investor, by noon we got his reply, saying that he was very interested.  Less than two weeks, we got the first million dollar angel investment check.  Investors label it as a million dollar slide, which is believed to have not only shown the depth of language technology but also shows the great potential for practical applications.

165325a3pamcdcdr3daapw

Pre-Knowledge Graph: Architecture of Information Extraction Engine

 

【Related Chinese Blogs】

NLP Overview

Pre-Knowledge Graph: The Architecture of Information Extraction Engine

Natural language parser is to reveal the mystery of the language like a LIGO-type detector

Dream come true

( translated from http://blog.sciencenet.cn/blog-362400-981742.html )

The speech generation of the fully automatically translated, un-edited science blog of mine is attached below (for your entertainment :=), it is amazingly clear and understandable (definitely clearer than if I were giving this lecture myself with my strong accent).  If you are an NLP student, you can listen to it as a lecture note from a seasoned NLP practitioner.

Thanks to the newest Google Translate service from Chinese into English at https://translate.google.com/ 

 

 

[Related]

Wei’s Introduction to NLP Architecture Translated by Google

"OVERVIEW OF NATURAL LANGUAGE PROCESSING"

"NLP White Paper: Overview of Our NLP Core Engine"

【立委科普:谷歌NMT,见证奇迹的时刻】

微信最近疯传人工智能新进展:谷歌翻译实现重大突破!值得关注和庆贺。mt 几乎无限量的自然带标数据在新技术下,似乎开始发力。报道说:

十年前,我们发布了 Google Translate(谷歌翻译),这项服务背后的核心算法是基于短语的机器翻译(PBMT:Phrase-Based Machine Translation)。

自那时起,机器智能的快速发展已经给我们的语音识别和图像识别能力带来了巨大的提升,但改进机器翻译仍然是一个高难度的目标。

今天,我们宣布发布谷歌神经机器翻译(GNMT:Google Neural Machine Translation)系统,该系统使用了当前最先进的训练技术,能够实现到目前为止机器翻译质量的最大提升。我们的全部研究结果详情请参阅我们的论文《Google`s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation》。

几年前,我们开始使用循环神经网络(RNN:Recurrent Neural Networks)来直接学习一个输入序列(如一种语言的一个句子)到一个输出序列(另一种语言的同一个句子)的映射。其中基于短语的机器学习(PBMT)将输入句子分解成词和短语,然后在很大程度上对它们进行独立的翻译,而神经机器翻译(NMT)则将输入的整个句子视作翻译的基本单元。

这种方法的优点是:相比于之前的基于短语的翻译系统,这种方法所需的工程设计更少。当其首次被提出时,NMT 在中等规模的公共基准数据集上的准确度,就达到了可与基于短语的翻译系统媲美的程度。

自那以后,研究者已经提出了很多改进 NMT 的技术,其中包括模拟外部对准模型(external alignment model)来处理罕见词,使用注意(attention)来对准输入词和输出词 ,以及将词分解成更小的单元应对罕见词。尽管有这些进步,但 NMT 的速度和准确度还没能达到成为 Google Translate 这样的生产系统的要求。

我们的新论文描述了怎样克服让 NMT 在非常大型的数据集上工作的许多挑战、如何打造一个在速度和准确度上都足够能为谷歌 用户和服务带来更好的翻译体验的系统。

来自对比评估的数据,其中人类评估者对给定源句子的翻译质量进行比较评分。得分范围是 0 到 6,其中 0 表示「完全没有意义的翻译」,6 表示「完美的翻译」。

............

使用人类评估的并排比较作为一项标准,GNMT 系统得出的翻译相比于之前基于短语的生产系统有了极大提升。

在双语人类评估者的帮助下,我们在来自维基百科和新闻网站的样本句子上测定发现:GNMT 在多个主要语言对的翻译中将翻译误差降低了 55%-85% 甚至更多。

今天除了发布这份研究论文之外,我们还宣布将 GNMT 投入到了一个非常困难的语言(汉语-英语)的翻译的生产中。

现在,移动版和网页版的 Google Translate 的汉英翻译已经在 100% 使用 GNMT 机器翻译了——每天大约 1800 万条翻译。GNMT 的生产部署是使用我们公开开放的机器学习工具套件 TensorFlow 和我们的张量处理单元(TPU:Tensor Processing Units),它们为部署这些强大的 GNMT 模型提供了足够的计算算力,同时也满足了 Google Translate 产品的严格的延迟要求。

汉语到英语的翻译是 Google Translate 所支持的超过 10000 种语言对中的一种,在未来几个月,我们还将继续将我们的 GNMT 扩展到远远更多的语言对上。

from 谷歌翻译实现重大突破

作为老机译,不能不被吸引。准备小试一下这最新版的谷歌神经翻译。
此前试过谷歌在线翻译,总体不如百度,可现如今说汉语mt已经很神经了:深度神经,接近人类。我有几百篇待译 正好一试,先试为快。期待谷歌的神译。

董:
@wei 但愿不致让你失望。我曾半开玩笑地说:规则机译是傻子,统计机译是疯子,现在我继续调侃:神经机译是“骗子”(我绝不是指研发者)。语言可不是猫脸或马克杯之类的,仅仅表面像不行,内容也要像!

我:现在是见证奇迹的时刻:

The automatic speech generation of this science blog of mine is attached here, it is amazingly clear and understandable. If you are an NLP student, you can listen to it as a lecture note from a seasoned NLPer (definitely clearer than if I were giving this lecture myself with my strong accent).   More amazingly, the original blog was in Chinese and I used the newest Google Translate claimed to be based on deep learning using sentence-based translation as well as character-based techniques.  My original blog in Chinese is here, you can compare:【立委科普:自然语言系统架构简说】。

董老师,您是知道我的背景和怀疑的。但是,面对这样的进步,这种远远超出我们当初入行的时候可以想象的极限的自动翻译质量和鲁棒性,我们不得不,不得不,不得不叹服。

董:
用他们的术语叫“less adequate,but more fluent”。机译已经历了三次paradigm shift,当人们发现无论如何它只能是一种很好的信息处理工具,而无法取代人类翻译时,那就选取代价较少的。

我:
无论如何,这个小小的测试让我这个老机译有点瞠目结舌。还没有从这种冲击回过味来。当然,赶巧我测试的是正规文体,谈的又是电脑和NLP话题,肯定在语料库的涵盖范围内,撞上枪眼了。可比起此前我使用的前神经时代的谷歌SMT和百度SMT,这个飞跃式突破还是让人惊异的。向我们的神经同行致敬。这是一批绝顶聪明的疯子。

毛老,这是我对谷歌最近的 claim 的一个反馈。上次对他们的 parser 嘲笑了一通,这次对他们用同样技术带来的MT的突破,表达一下深深的敬佩。这种 contrast 不是我神经了,或分裂了,而是在 parsing,他们苦于没有自然带标数据,巧妇难为无米之炊,因此无法跟符号逻辑派比试。可是 MT 不同,几乎有无穷无尽的自然带标数据(人的翻译从来没有终止过,留下的对照译文浩如烟海)。

毛: @wei 这就是说,你服了基于神经元的MT,改变了自己的见解和主张?

我: 我服了,但没真地改变。

毛: 怎么说?

我:
无论多少门户之见,基本事实总应该看清吧。听一下上面列出的他们的 SMT 翻译,其流利程度和对我原文的忠实程度,已经超过了一个普通的人做翻译。因为一个口译如果不懂我这一行,我如果拿着这个讲稿讲课,让这样的 average interpreter 做现场翻译,他是比不过机器的,无论信还是达。(翻译高手不论。)这一点不得不服。另一方面,我以前说的,神经再深度,我看不出它在今后几年内可以赶上我的深度 parser,这一点还没改变。尤其是应对不同 domains 和 genres 的能力,他们不可能做到。因为在自然世界里面,没有带标的句法树,有的只是线性句子。而迄今看到的突破都是有监督的深度学习(supervised learning),离开海量带标数据(labeled data)就抓瞎了。

毛: 我被你弄糊涂了。你究竟是说哪一派强哪?@wei 究竟谁是世界第0?

我: parsing 我是第0,谷歌赶不上。MT 谷歌有重大突破,估计符号逻辑派的 MT 的日子不好过。

毛: 我问的是,MT谁是第0,不管用什么方法。

我: 这不是说,MT 规则系统就没有活路了,但是总体而言,SMT(statistical MT)占据上风的 trends 还在增强。

云: THKS. 我来试试能不能翻译我写的公司白皮书?

我:
你要是加一点译后人工编辑的话,我估计会很好的。再不要傻傻地从头请人工做翻译了。翻译公司如果不使用 MT 做底,将会被淘汰,成本上看很难存活。

董:
学习上,初二是一个分水岭,学科数量明显增多,学习方法也有所改变,一些学生能及时调整适应变化,进步很快,由成绩中等上升为优秀。但也有一部分学生存在畏难情绪,将心思用在学习之外,成绩迅速下降,对学习失去兴趣,自暴自弃,从此一蹶不振,这样的同学到了初三往往很难有所突破,中考的失利难以避免。
Learning, the second is a watershed, the number of subjects increased significantly, learning methods have also changed, some students can adjust to adapt to changes in progress, progress quickly, from the middle to rise to outstanding. But there are some students there is fear of hard feelings, the mind used in the study, the rapid decline in performance, loss of interest in learning, self-abandonment, since the devastated, so the students often difficult to break through the third day,

毛: 这翻译没什么好呀?

我:
要的就是这句话 🙂 @毛,需要一个对比,才能回答你的问题。

毛: 那就拿你的出来比比?

我: 我早就不做 MT 了,我是逃兵。近20年前,我就转移到信息抽取 IE(以及sentiment)挖掘了,这方面我有底气,不怕比。

刘:转:谷歌新版翻译有多神?英文教授亲证后告诉你...

我: thanks 似乎评论得比较中肯。对于口语现在肯定还不行,它的训练集一直对口语 cover 的很不够。以前测过,有些常用的简单的口语表达也弄错。不知这次这方面加强多少。

董老师上面给的那段谷歌翻译的段落,毛老说翻译不咋样。不过我做过MT多年,知道达到这一步其实是有很大进步的。以前的汉译英的不可读,到现在读起来大体能听得懂。这里面有很多的进步值得留意。

刘: @wei 转一个: 话说大数据干的一些事属于人工智能操练(不能用“研究”这词了)的范畴吗,那本来不就是传统计算机系的人干的?动不动嘲笑每开掉一个语言学家就往前多走几步这眼界太浅了

马: 在数据充足的领域,这几年DL方法都突飞猛进,我认识的好几个以前对DL有偏见的都多少改变了看法。在IR领域DL还基本不见功效,但也在慢慢渗透中。

毛: 不认同“传统计算机系”这个话。计算机系应该跟着实践走,而不是反过来。

董:
NMT的关键是个“像”。于是出了一个有时不懂原文的人以为翻的很顺溜。没了信的翻译,不就成骗子了吗?如何知道自己的翻译是满拧了?这也是NMT的死穴。

马: 董老师,我觉得统计方法都有这个死穴

我:
寸有所长尺有所短,这也不奇怪。我今天自己听这个对我的blog的翻译已经三篇了,一步一叹。NND 咋这么溜呢。找碴子 找翻译错 总是有的。可是人也有译错啊。从可懂度和流畅程度看,我反正是服了。而这发生在没有亲属关系的两个语言之间。

董:
想当年有的领导干部对我说,“其实机器翻译只有百分之50的正确性,也不要紧,问题是你能不能把那个一半告诉我,我可以找人专翻那部分。”我回答说我做不到。从那时起我一直在关注这个问题。直到如今很多人在叫嚷要取代人工翻译了。这真有点像有了麦当劳就说不要法式大餐了一样。何况机译还做不到麦当劳。计算机、以致机译是上帝给人类玩的,上帝没有给人类那种可以复制自己的本领。

洪:

我的观点很简单:
影子不能三维变。
人若二维非压扁,
自叹弗如影子前。

人工智能影子般,
随人活动数据攒。
深度学习模型建,
类似皮影戏好玩。

董:
是的。我曾对照过10多本英国名著,曾经发现有一本译著明显的是译者故意大段大段地漏译的,那里面有太多的花草等,估计大师懒得查。就不译了。

为什么GNMT首先选择的语言对是汉英,而不是英汉?这是非常精明的。人工翻译即使错了或漏了,译文通常会是顺溜的,至少绝不会像传统的机译那样有傻又疯的,诘屈聱牙的,而这正是NMT的特点,它挑选的是译文中最大相似的。那样一来广大的英语读者,多数不懂中文,就容易被它“唬住”了。

我:
对。仔细看来,这次“突破”是达有余而信不足,矫枉过正了。
但一切才刚开始。我可以理解做NMT的人面对突破的欣喜心情。

洪:
伟爷久玩nlp,
一直孤傲头不低。
今朝服膺叹奇迹,
深度神经已皈依!

我:
皈依还不至于,也不够格。赞佩是由衷的,希望今后有合作的机会,取长补短,达成共赢。人家要是看不上咱呢,咱就单干。deep parsing 是 NLP 的皇冠。神经 parsing 何时全方位超过在下,咱就退休。现在仍然觉得,照这个标准,估计这辈子也退休不了。但愿自己错了,可以提早周游世界。

 

【相关】

Wei’s Introduction to NLP Architecture

谷歌翻译实现重大突破

谷歌新版翻译有多神?英文教授亲证后告诉你...

立委科普:NLP 联络图】(姐妹篇)

机器翻译

Wei's Introduction to NLP Architecture Translated by Google

Introduction to NLP Architecture
by Dr. Wei Li
(fully automatically translated by Google Translate)

The automatic speech generation of this science blog of mine is attached here, it is amazingly clear and understandable, if you are an NLP student, you can listen to it as a lecture note from a seasoned NLPer (definitely clearer than if I were giving this lecture myself with my strong accent):

To preserve the original translation, nothing is edited below.  I will write another blog to post-edit it to make this an "official" NLP architecture introduction to the audiences perused and honored by myself, the original writer.  But for time being, it is completely unedited, thanks to the newly launched Google Translate service from Chinese into English at https://translate.google.com/ 

[Legislature science: natural language system architecture brief]

For the natural language processing (NLP) and its application, the system architecture is the core issue, I blog [the legislature of science: NLP contact diagram] which gave four NLP system architecture diagram, now one by one to be a brief .
I put the NLP system from the core engine to the application, is divided into four stages, corresponding to the four frame diagram. At the bottom of the core is deep parsing, is the natural language of the bottom-up layer of automatic analyzer, this work is the most difficult, but it is the vast majority of NLP system based technology.

160213sg5p2r8ro18v17z8

The purpose of parsing is to structure unstructured languages. The face of the ever-changing language, only structured, and patterns can be easily seized, the information we go to extract semantics to solve. This principle began to be the consensus of (linguistics) when Chomsky proposed the transition from superficial structure to deep structure after the linguistic revolution of 1957. A tree is not only the arcs that express syntactic relationships, but also the nodes of words or phrases that carry various information. Although the importance of the tree, but generally can not directly support the product, it is only the internal expression of the system, as a language analysis and understanding of the carrier and semantic landing for the application of the core support.

160216n8x8jj08qj2y1a8y

The next layer is the extraction layer (extraction), as shown above. Its input is the tree, the output is filled in the content of the templates, similar to fill in the form: is the information needed for the application, pre-defined a table out, so that the extraction system to fill in the blank, the statement related words or phrases caught out Sent to the table in the pre-defined columns (fields) to go. This layer has gone from the original domain-independent parser into the face-to-face, application-oriented and product-demanding tasks.
It is worth emphasizing that the extraction layer is domain-oriented semantic focus, while the previous analysis layer is domain-independent. Therefore, a good framework is to do a very thorough analysis of logic, in order to reduce the burden of extraction. In the depth analysis of the logical semantic structure to do the extraction, a rule is equivalent to the extraction of thousands of surface rules of language. This creates the conditions for the transfer of the domain.
There are two types of extraction, one is the traditional information extraction (IE), the extraction of fact or objective information: the relationship between entities, entities involved in different entities, such as events, can answer who dis what when and where When and where to do what) and the like. This extraction of objective information is the core technology and foundation of the knowledge graph which can not be renewed nowadays. After completion of IE, the next layer of information fusion (IF) can be used to construct the knowledge map. Another type of extraction is about subjective information, public opinion mining is based on this kind of extraction. What I have done over the past five years is this piece of fine line of public opinion to extract (not just praise classification, but also to explore the reasons behind the public opinion to provide the basis for decision-making). This is one of the hardest tasks in NLP, much more difficult than IE in objective information. Extracted information is usually stored in a database. This provides fragmentation information for the underlying excavation layer.
Many people confuse information extraction and text mining, but in fact this is two levels of the task. Extraction is the face of a language tree, from a sentence inside to find the information you want. The mining face is a corpus, or data source as a whole, from the language of the forest inside the excavation of statistical value information. In the information age, the biggest challenge we face is information overload, we have no way to exhaust the information ocean, therefore, must use the computer to dig out the information from the ocean of critical intelligence to meet different applications. Therefore, mining rely on natural statistics, there is no statistics, the information is still out of the chaos of the debris, there is a lot of redundancy, mining can integrate them.

160215hzp5hq5pfd1alldj

Many systems do not dig deep, but simply to express the information needs of the query as an entrance, real-time (real time) to extract the relevant information from the fragmentation of the database, the top n results simply combined, and then provide products and user. This is actually a mining, but is a way to achieve a simple search mining directly support the application.
In fact, in order to do a good job of mining, there are a lot of work to do, not only can improve the quality of existing information. Moreover, in-depth, you can also tap the hidden information, that is not explicitly expressed in the metadata information, such as the causal relationship between information found, or other statistical trends. This type of mining was first done in traditional data mining because the traditional mining was aimed at structural data such as transaction records, making it easy to mine implicit associations (eg, people who buy diapers often buy beer , The original is the father of the new people's usual behavior, such information can be excavated to optimize the display and sale of goods). Nowadays, natural language is also structured to extract fragments of intelligence in the database, of course, can also do implicit association intelligence mining to enhance the value of intelligence.
The fourth architectural diagram is the NLP application layer. In this layer, analysis, extraction, mining out of the various information can support different NLP products and services. From the Q & A system to the dynamic mapping of the knowledge map (Google search search star has been able to see this application), from automatic polling to customer intelligence, from intelligent assistants to automatic digest and so on.

16221285l5wkx8t5ffi8a9

This is my overall understanding of the basic architecture of NLP. Based on nearly 20 years in the industry to do NLP product experience. 18 years ago, I was using a NLP structure diagram to the first venture to flicker, investors themselves told us that this is million dollar slide. Today's explanation is to extend from that map to expand from.
Days unchanged Road is also unchanged.

Where previously mentioned the million-dollar slide story. Clinton said that during the reign of 2000, the United States to a great leap forward in Internet technology, known as. Com bubble, a time of hot money rolling, all kinds of Internet startups are sprang up. In such a situation, the boss decided to hot to find venture capital, told me to achieve our prototype of the language system to do an introduction. I then draw the following three-tier structure of a NLP system diagram, the bottom is the parser, from shallow to deep, the middle is built on parsing based on information extraction, the top of the main categories are several types of applications, including Q & A system. Connection applications and the following two language processing is the database, used to store the results of information extraction, these results can be applied at any time to provide information. This architecture has not changed much since I made it 15 years ago, although the details and icons have been rewritten no less than 100 times. The architecture diagram in this article is about one of the first 20 editions. Off the core engine (background), does not include the application (front). Saying that early in the morning by my boss sent to Wall Street angel investors, by noon to get his reply, said he was very interested. Less than two weeks, we got the first $ 1 million angel investment check. Investors say that this is a million dollar slide, which not only shows the threshold of technology, but also shows the great potential of the technology.

165325a3pamcdcdr3daapw

Pre - Knowledge Mapping: The Structure of Information Extraction Engine

【Related】
[Legislature science: NLP contact map (one)]
Pre - Knowledge Mapping: The Architecture of Information Extraction Engine
[Legislature science: natural language parsers is to reveal the mystery of the language LIGO-type detector]
【Essay contest: a dream come true
"OVERVIEW OF NATURAL LANGUAGE PROCESSING"

"NLP White Paper: Overview of Our NLP Core Engine"

White Paper of NLP Engine

"Zhaohua afternoon pick up" directory

[Top: Legislative Science Network blog NLP blog at a glance (regularly updated version)]

nmt1

nmt2

nmt3

nmt4

nmt5

nmt6

nmt7

retrieved 10/1/2016 from https://translate.google.com/

translated from http://blog.sciencenet.cn/blog-362400-981742.html