Unified Models Surpass Single-modal Models  (Gemini Notes 2/8)

by Zhi-Fei Li, Gao Jia, Wei Li, from "Brother Fei on AI"


Multi-modal Large Unified Models Finally Surpass Specific Single-modal Models  

Humans perceive, cognize, and generate emotions and consciousness through the integration of multiple senses. Gemini is also practicing this approach, processing multiple modal inputs, integrating them in the brain, and then expressing through various modal outputs. This comprehensive "simulation" of human intelligence by such models is rapidly evolving.

Previously, multi-modal model training resembled a system composed of separate eyes, ears, arms, and brains, lacking strong coordination. However, the direction represented by Gemini feels significantly different: it's as if the large model has become a complete digital person, where hands, eyes, brain, and mouth work in harmonious silicon unity. Gemini is the first true end-to-end multi-modal system.

In the past, models optimized for a single modality usually outperformed those handling multiple modalities simultaneously. The common practice was single-modality model training. Even GPT-4 primarily "concatenates" different modalities into an overarching framework, rather than being a unified multi-modal model.

The exciting aspect of Gemini is that it was designed from the start as a native multi-modal architecture. The training process interweaves various modal data from the beginning. If previous large models were like attaching sensory organs or mechanical arms to a brain externally, Gemini is like growing its own eyes, ears, and arms internally, allowing for fluid and natural interaction.

Whether in terms of model architecture, training process, or final output, Gemini achieves a seamlessly integrated multi-modal experience.

For the first time, Gemini demonstrates that a unified model can handle all modalities, and perform even better than models focused on a single modality! For example, compared to the Whisper model, which is optimized for voice recognition, Gemini shows a significant improvement in accuracy.

This signifies the dawn of the era of unified multi-modal models.


In fact, Gemini is not the first model to demonstrate that different modalities can mutually enhance performance. This was also evident in PaLM-E, where "PaLM-E, trained across different domains including general vision-language tasks at internet scale, showed a marked improvement in performance compared to models performing single tasks in robotics."

Another example of modalities enhancing each other is the multilingual processing ability of large language models. If we consider different languages as distinct "modalities," the practice of large language models has proven that processing native data of all languages together (through tokenization and embedding) managed to lead to the successful construction of a human language tower of Babel.

The overwhelming amount of English data in the training of large language models also benefits the model's understanding and generation of languages with limited data, reaffirming the transfer of linguistic knowledge. It's akin to a person skilled in tennis also being able to improve their abilities in squash or golf through related skills.

Since the rise of large models in February this year, many have gradually embraced the belief that "unified multi-modal models will surpass single-modality models." However, this belief hadn't been confirmed on a large scale until Google's Gemini showcased the prospects of this belief, reshaping and solidifying it for many.

In the future, specialized models for tasks like voice recognition or machine translation may become less significant. Many generative tasks such as TTS and image generation are also likely to be unified under large models. Some may complain about the high cost and slow speed of large unified models, but these are purely technical challenges. In practice, we can distill unified models to specific modalities or scenarios.

We firmly believe that unified cross-modal large models will become the mainstream pathway to achieving AGI.

Furthermore, "modalities" are not just sound, images, videos, etc. Olfactory, gustatory, tactile, temperature, and humidity sensors are also different modalities for gathering environmental information, all of which can in time be encompassed by unified models.

Ultimately, various modalities are merely carriers of "information." They are a form of rendering, a presentation style, a means for an intelligent entity to interact with the physical world. In the eyes of a unified model, all modalities internally can be represented by unified multi-dimensional vectors, enabling cross-modal knowledge transfer and the intersection, alignment, fusion, and reasoning of information.

When the barriers between modalities are breached, revealing the core beneath various renderings, we see the origin of cognition — language.




(Gemini Notes Series to be continued)


Original from:

关于 Google Gemini 的八点启示

by Zhi-Fei Li, Gao Jia, Wei Li, from "Brother Fei on AI"


老友群里女同学重播当年的几首时代大合唱《明天会更好》《让世界充满爱》以及《we are the world》,面对战乱和纷扰的2023年即将的过去,感叹道:今天的世界怎么了?回放世界和平年的几首歌,悲从中来,欲哭无泪。




其实,我们的父辈在刚解放的头几年也有过意气风发的美丽憧憬,《青春万岁》留下 了记录。此后的政治运动不断,才给玫瑰色的画面抹上了阴影。在大革命10年浩/劫后,邓公带给我们上大学、研究生的机会,百废待兴,百花齐放,社会充满了生机,我们充满了希望和担当。这是历史的机遇,也是美丽的邂逅。


爱不需要理由,AI 却不能无理由疯行。

-- 虽然老马到中国居然学会了把爱与AI连上。

-- 虽然Ilya声称要给模型植入爱人类的心。

-- 虽然我们每个个体依然循着惯性或本性,总是怀念单纯爱,可面对的却是乱世怪象:信息茧房,真假莫辨,快餐文化,爽一把就死。似乎没有明天,没有盼望。





2024 人类的马儿呀,不仅仅是AI,你能慢点儿跑,稳点儿跑,带着悲悯和人心跑吗?



Cross-modal Knowledge Transfer of Large Models Proven (Gemini Notes 1/8)

by Zhi-Fei Li, Gao Jia, Wei Li, from "Brother Fei on AI"


In 1948, inspired by psychiatric patients, British doctor Ross Ashby invented a peculiar machine called the "Homeostat." He proclaimed that this device, costing about 50 pounds, was "the closest thing to an artificial brain ever designed by mankind." The Homeostat utilized four bomb control switch gear devices from the British Royal Air Force, used during World War II, as its base. Above these were four cubic aluminum boxes, with the only visible moving parts being four small magnetic needles on top of the boxes, swaying like compass needles in a small trough of water.

When the machine was activated, the needles moved in response to the electric current from the aluminum boxes. The four magnetic needles were always in a sensitive and fragile state of balance. The sole purpose of the Homeostat was to keep the needles centered, maintaining a "comfortable" state for the machine.

Ashby experimented with various methods to make the machine "uncomfortable," such as reversing the polarity of the electrical connections or the direction of the needles. However, the machine always found ways to adapt to the new state and re-center the needles. Ashby described the machine as "actively" resisting any disturbances to its balance through synaptic action, performing "coordinated activities" to regain equilibrium.

Ashby believed that one day, such a "primitive device" could evolve into an artificial brain more powerful than any human, capable of solving the world's most complex and challenging problems.

Despite Ashby's lack of knowledge about today's AGI evolution and the laughable idea of using four small magnetic needles as sensors for intelligence, his Homeostat fundamentally challenged everyone's understanding of "intelligence" - isn't intelligence the ability to absorb information from the environment in various modalities, and to modify behavior and responses based on feedback?

From the peculiar "Homeostat" to today, 75 years later, Google's Gemini, which claims to have surpassed human multi-modal task processing abilities, accelerates towards the evolution of billions of years of carbon-based intelligence through the injection of multi-modal native big data.

The acceleration speed of machine intelligence evolution today far exceeds our imagination. A year ago, OpenAI overturned Google's long-established AI position with its 'brute force aesthetic,' having constructed the Babel Tower of human languages. A year later, Google countered with Gemini, via a 'fight fire with fire' approach to building the first unified cross-modal model, setting another milestone in AGI evolution.

Despite initial skepticism over exaggerated video demos upon Gemini's release, it's undeniable that the dawn of a unified multi-modal approach is shining. What capabilities does Gemini confirm? How will Google's wheels of fate turn? Is time a friend to OpenAI or Google? What does multi-modality mean for Agents and embodied intelligence? Are the foundations for the emergence of AGI with consciousness already in place? How should we view the implications of Gemini for the AI future?


Cross-modal Knowledge Transfer of Large Models Proven Again

For humans, the ability to transfer knowledge across various domains and through different timespaces is more important than merely learning skills. If machines can master cross-modal knowledge transfer, they edge closer to "intelligence generality."
In July this year, Google introduced RT-2, a robotic system based on large models, sparking hope for general-purpose robots.  The system's robotic arm, leveraging the "common sense" of language models, demonstrated the ability to "pick up an extinct animal from a table," moving from common sense reasoning to robotic execution, showcasing cross-modal knowledge transfer. 
In December, the introduction of Gemini by this tech giant reaffirmed the cross-modal knowledge transfer capability of large models: the "common sense" of language models could be transferred to the training of other non-linguistic modalities added later. Language models are known to form the foundation of cognitive intelligence, and the most basic form of cognitive intelligence is "common sense."  Without common sense empowerment, the practical application of large multi-modal models would be challenging.  Gemini smoothly transfers this "common sense" to downstream multi-modal tasks.  Like RT-2, it achieves cross-modal integration through the transfer of text-derived knowledge — Gemini can connect ontology concepts to the understanding of auditory and visual objects, and eventually link them with action, forming an intelligent system ready for real world application. 
From the perspective of model training, compared to language models trained with massive internet data, downstream models (like robotic models) can be trained with very limited data through knowledge transfer.  This transfer-based training manages to address the long-standing issue of data scarcity in downstream applications.  For instance, to achieve the effects shown in the video (which raised doubts about Gemini's video comprehension or picture comprehension but did not affect the discussion on cross-modal knowledge transfer here), Gemini first needs some ontological knowledge — it understands the concept of a duck, knows the usual color of ducks, and what blue is. When it sees a "blue duck," it reacts similarly to humans, expressing the "common sense" that "blue ducks are uncommon." 
Gemini, through auditory and visual perception, identifies that the material of the blue duck is rubber and knows that rubber's density is less than water's. Based on this common sense and reasoning, when it hears a squeaking sound, it can predict that "the blue duck can float on water." 
From RT-2 to Gemini, we've moved to the "fusion" of multi-modal perceptual intelligence and cognitive intelligence. We've transitioned from isolated "five senses" modules of eyes, ears, mouth, nose, and body to a unified digital "human". 
Doesn't this imply that on the path to simulating human intelligence, the unified model is the right approach? 




(Gemini Notes Series to be continued)


Original from:

关于 Google Gemini 的八点启示

by Zhi-Fei Li, Gao Jia, Wei Li, from "Brother Fei on AI"


作者 | 高佳   李维
创意 | 李志飞

在 RT-2 和 Gemini 中,以语言为基础的认知智能始终是人类知识模拟的核心,其中常识及其推理的知识迁移起到了关键作用。例如在 RT-2 中,反映语言模态的数据量和参数规模都远远大于下游的图片和动作模态的规模。
这一点做到了,就凸显了语言模型对AGI的最大贡献,因为它真正体现了研究人员对语言大模型的初心和定位——作为 Foundation ModelCore Engine.

关于 Google Gemini 的八点启示




Tanya's Childhood 2: American nursery rhymes

The text provided here is a nostalgic recounting of a parent reminiscing about his daughter's childhood, particularly focusing on various American nursery rhymes and the daughter's playful interactions. The parent reflects on the limited recordings they have of their daughter from when she was young, which were transferred from an iPod to an iPhone and often played in the car, blending with music into fond memories of the past.

The daughter is described as a talkative and somewhat rapid-fire speaker as a child, who enjoyed showing off nursery rhymes.


April 13, 2019

立委_米拉的微博视频 or YouTube:

As I navigate through the cherry blossom season, I'm engulfed in a wave of nostalgia, reflecting on the fleeting moments of my daughter's childhood. It's remarkable how certain memories, like her voice from those few recordings we made, have ingrained themselves in my heart. These snippets, once captured in an iPod and now residing in my iPhone, have become an auditory pathway back to those treasured times.

My daughter was always a chatterbox, her words often racing ahead of her thoughts. She had a particular fondness for American nursery rhymes, relishing in their playful rhythms and catchy phrases. I fondly recall how she would eagerly recite them, her voice filled with the enthusiasm of youth.


One of her favorite rhymes was a humorous jibe at boys:

"Boys go to Jupiter to get more stupider, girls go to college to get more knowledge."

She'd recite it with such dramatic flair, emphasizing each word, as if imparting some profound wisdom. Her rendition was always animated, almost rap-like, making it impossible not to smile.

“what do you want me to say now? boys go to Jupiter , do you know the planet Jupiter? they go to the planet Jupiter, once they get there, they get stupider and stupider every second. And girls they go to college to get more knowledge and knowledge into their brain on their head.”

"Eeny, meeny, miny, moe" was another staple in her repertoire.

“Eeny, meeny, miny, moe,
Catch a tiger by the toe.
If he hollers, let it go,
Eeny, meeny, miny, moe."

It's fascinating to think about how this simple rhyme was more than just a game; it was a glimpse into the cunning minds of children. They'd use it to make choices, but often, the outcome was already decided in their hearts. They'd cunningly manipulate the ending to suit their desired choice, either accepting or rejecting it with a claim like

"My mother told me to pick the very best one, and you are not it."

Or, “My mother says to pick the very best one, and that is YOU”.

Among these recordings was a playful, teasing rhyme that still brings a chuckle:

“You know what
Kick your butt
All the way to Pizza Hut

While you’re there,
Comb your hair
Don’t forget your underwear!”

This rhyme, intertwined with stories of school and friendships, showcased the innocent yet intricate world of children's social dynamics.

“I said that I am the Princess of Jewelry because one of my friends and buddy said that she looked at my jewelry I brought to school.  What happened is she was so surprised and she loved it … she said that I am Princess of Jewelry and she is the Queen of Makeup.  Next time I am going to bring new jewelry, she said that I am the Queen of Jewelry…… No,Daddy, Jessica said I am the Queen of Jewelry if I bring some new jewelry tomorrow.”

A particularly memorable story was about Tanya proclaiming herself the "Princess of Jewelry" after a school friend complimented her on her collection. This interaction with her friend, Jessica, who crowned herself the "Queen of Makeup," was a brilliant display of childhood diplomacy and innocence.

Tanya's excitement at the thought of being elevated to the "Queen of Jewelry" the next day if she brought new jewelry to school was both touching and amusing.

Listening to these recordings also brought into stark relief the difference between a native language and a second language. Her English, fluid and expressive, stood in contrast to her Mandarin, which, despite her efforts at weekend Chinese school, sounded labored and less natural.

These memories, encapsulated in a few precious recordings, remind me of how quickly time passes. They're not just echoes of Tanya's childhood but also emblems of a period that seems both distant and vividly close. In the beauty of the cherry blossoms, I find a reflection of those bygone days, a tender reminder of the passage of time.




Tanya's Childhood 1: McDonalds


Parenting is one of the most memorable experiences in life. The growing up moments of children often bring surprises and accumulate into warmth and affection. Many delightful father-daughter conversations are casually recorded, while others are lost with the wind. Life's journey is full of suspense, and it's our belief in our daughters that supports us through the lows, urging us not to stop moving forward.

永远的麦当劳 / Forever McDonald's

我们在水牛城的时期,一到周末,大小领导常常在工厂直销中心(Factory Outlets)不知疲倦地购物,跟厂商玩着买了退退了买的游戏。我跟往常一样,找一家附近的麦当劳快餐店,打开膝式苹果电脑,就着炸薯条,品着咖啡,上网有一眼无一眼看看老友们在闲极无聊中又整出什么让人跌破眼镜的新鲜事来,头脑里想的是怎样来写这篇酝酿已久的"麦克唐纳万岁"。还好,太阳底下没有新鲜事,只是一帮理呆在争论《十万个为什么》中的飞机为什么能飞的问题,争了几个月了,还没有结果。扯嘛,飞机不能飞还叫飞机吗?还是先回答鸟儿为什么能飞吧,飞机不就是人类的大鹏嘛。

During our time in Buffalo City, every weekend, the 'big and small bosses' (wife & daughter lol)would tirelessly shop at the Factory Outlets, playing the game of buying and returning with the merchants. As usual, I would find a nearby McDonald's, open my laptop, enjoy fries and coffee, and half-heartedly browse the internet to see what new, shocking things my bored old buddies had come up with. I pondered how to write the long-brewing "Long Live McDonald's." Fortunately, under the sun, there's nothing new; just a bunch of nerds arguing about why planes can fly, as described in books like "One Hundred Thousand Whys," without any conclusion for months. Ridiculous - if planes couldn't fly, would they still be planes? Maybe it's better to answer why birds can fly first, as planes are just the great rocs of humanity.

回到麦当劳。不管营养师怎样呼吁围剿所谓垃圾食品,也不管爱国分子怎样鼓噪抵制西方餐饮大王的侵入,麦当劳在我的心中金光闪烁,温馨惬意,有如家园。麦当劳给我的美好感觉,不在它的价廉物鲜 — 当然是新鲜的鲜,并非鲜美的鲜,毕竟是鬼子食。炸薯条和鸡块还是不错的,汉堡包在饿极时也可以下咽,比那些冷冰冰的三明治稍强。麦当劳的美好也不仅仅是它卫生亮敞的环境和茶馆一样的平易可亲的氛围。真正使麦当劳万寿无疆的是它的 Happhy Meal(儿童欢乐套餐)和它附带的儿童园地(Ronald's Playhouse)。Happy Meal 给儿时的女儿带来过无数的惊喜和欢乐,麦当劳儿童园地也见证了我跟女儿一起度过的无数美好快乐的时光。

Back to McDonald's. Regardless of how nutritionists call for a boycott of so-called junk food, or patriots decry the invasion of Western fast food giants, McDonald's shines brightly in my heart, cozy and homely. Its appeal isn't just in its inexpensive food – fresh in terms of newness, not taste, as it's still fast food after all. The fries and chicken nuggets aren't bad, and the burgers are tolerable when you're really hungry, better than cold sandwiches. But McDonald's charm isn't just in its clean, bright environment and the approachable atmosphere of a tea house. What really makes McDonald's everlasting is its Happy Meal and the accompanying Ronald's Playhouse. Happy Meals have brought countless surprises and joy to my daughter in her childhood, and Ronald's Playhouse has witnessed many wonderful moments we've shared.

对麦当劳的最初印象是我2015年前出国旅欧时形成的。一帮清贫的学生决定结伴周游欧洲各国。旅游并非阔人的专利,学生有学生的招数:买一张物超所值的铁路通票,就有了游遍欧洲的基本保障,食住行囊括了后两项。大体是白天游玩,晚上搭车加睡觉。有时一夜经过好几个国家,睡意朦胧中查验护照和签证,完了歪头再睡。一觉醒来,撞到什么旅游点,就下来走马观花。如果错过了什么名城胜景,可以转身搭车回转。随缘随机,倒也自在。这种旅行方式在学生中非常流行,对于节俭到苛刻的中国留学生更是如此。除了车票和门票(学生有优惠),唯一的开销就是吃了。旅游在外,胃口特别好,肚子时常闹意见,可旅游点的餐馆甚至小吃都价格不菲,就麦当劳的价格比较稳定。同学总结说:"Believe me, 游遍欧洲,颠扑不破的真理只有一条:麦当劳是唯一吃得起也吃得饱的所在。" 人以食为天,麦当劳的流水作业和薄利多销成全了它的餐饮业霸主的地位。

My first impression of McDonald's was formed during a trip to Europe before 2015. A group of poor students decided to tour various European countries together. Traveling isn't just for the wealthy; students have their ways: buying a value-for-money rail pass ensured basic travel across Europe, covering accommodation and transportation. We generally toured during the day and traveled and slept at night. Sometimes we'd pass through several countries in one night, vaguely waking up for passport and visa checks, then dozing off again. Waking up, we'd spontaneously visit whatever tourist spot we bumped into. If we missed any famous city or scene, we could easily catch a train back. This laissez-faire travel style was popular among students, especially thrifty Chinese international students. Aside from train and attraction tickets (with student discounts), our only major expense was food. Appetites grow when traveling, and stomachs often complain, but eating at tourist spots is expensive, making McDonald's a stable choice. A fellow student summarized, "Believe me, the only unbreakable truth in traveling across Europe is: McDonald's is the only place you can afford and get full." People need to eat, and McDonald's fast service and thin profit margins cemented its dominance in the food industry.

对麦当劳的亲密而频繁的接触,还是由于甜甜。玩具是儿童的天使,甜甜热衷于追踪麦当劳儿童套餐推出的每一款玩具,遇到她喜欢的主题,比如 Furby, Teletubby, 她总是要收集各种颜色和造型的全套才满足。为此,我也没有少吃儿童套餐,为的就是尽快收集完全。有一次我连续一周午餐吃儿童套餐, 甜甜感觉奇怪:“Dad, are you ok? Did you tell me you don't really like the McDonald's food?” 我笑笑,说:“it's not bad, actually I seem to like it. Important thing is, we got the toy”。后来甜甜终于悟出来了,跟小伙伴说:"I can't believe it. My Dad ate Happy Meals nonstop just to get a complete collection of my favorite toys." 语气里透着被宠爱的满足。

My close and frequent encounters with McDonald's were mostly due to my daughter, Tanya. Toys are angels to children, and she was keen on collecting every toy from McDonald's Happy Meals. Whenever she liked a theme, like Furby or Teletubby, she had to collect all the colors and designs. I ended up eating many Happy Meals to complete her collection. Once, I ate Happy Meals for lunch for a week straight. Tanya found it odd: "Dad, are you ok? Did you tell me you don't really like McDonald's food?" I just smiled and said, "It's not bad, actually I seem to like it. The important thing is, we got the toy." Eventually, Tanya realized and told her friends, "I can't believe it. My Dad ate Happy Meals nonstop just to get a complete collection of my favorite toys." She felt a satisfied sense of being spoiled.

麦当老儿童园地 / Ronald's Playhouse at McDonald's

在水牛城的岁月,麦当劳附设的儿童园地是我们最常光顾的场所,有吃有喝有迷宫,总有其他小朋友,甜甜在那里不到筋疲力竭不愿意回家。麦当劳迷宫,千转百迴,上下左右贯通,最受儿童喜爱。甜甜天生胆子小,很长一段时间,望宫兴叹。有一天,我们注意到麦当劳迷宫的游玩规定中写道:And parents, too! 原来允许做父母的跟孩子一块进去玩儿,于是陪着甜甜爬进那窄长园筒状迷宫通道,甜甜兴奋莫名,从此一发不可收拾。可怜我的老骨头,猫着腰跟一帮孩子在里面爬呀爬,很多家长旁观而笑。有孩子在迷宫哭闹的,就托我领孩子出宫。

During our time in Buffalo City, Ronald's Playhouse at McDonald's was our frequent haunt, with food and drink and a maze. There were always other kids, and Tanya wouldn't want to leave until she was completely exhausted. The maze at McDonald's, with its twists and turns, was a favorite among children. Tanya was initially timid, but one day, we noticed the Playhouse rules stated: And parents, too! So, I joined her in the narrow, cylindrical maze, and she was ecstatic. Poor me, crouching down and crawling with a bunch of kids, while many parents watched and laughed. When a child cried in the maze, I was often asked to help lead them out.


When traveling as a family, we'd often search for the next McDonald's on endless highways, especially when night fell and hunger struck. The golden neon 'M' sign stood tall and inviting, always offering a warm and casual welcome.

永远的麦当劳! / Forever McDonald's!


Written on Mother's Day 2007.