Understanding the Forward Deployed Engineer (FDE) Model for AI Startups

English Podcast

中文版本

最近,Y Combinator 请来了 Bob McGrew ——前 OpenAI 首席研究官,同时也是 PayPal 和 Palantir 的资深技术骨干。令人意外的是,在场的创业者们并没有追问他“如何打造下一个 GPT”,反而一窝蜂地想知道:Palantir 的 FDE 模式究竟是怎么运作的?Bob 也坦言,过去一年里,他为无数创业公司提供过咨询,几乎所有人都在痴迷研究这种模式如何真正落地。

什么是 FDE?

FDE(Forward Deployed Engineer,前线部署工程师) 的核心理念,是把工程师直接派驻到客户一线,负责打通“理想产品”与“真实需求”之间的鸿沟。这一思路最早源于 Palantir 服务美国情报机构的岁月。那时客户的挑战极其复杂、没有任何现成模板,只能“现场拼凑”解决方案。起初,很多人认为这种模式无法规模化、太过劳动密集,不符合标准化的 SaaS 理念。可如今,正在探索 AI Agent 与企业级落地的创业公司们,却纷纷把它奉为圭臬。

它是如何运作的

Palantir 把 FDE 团队拆分为两类角色:

  • Echo:行业洞察者,深入客户工作流程,挖掘核心痛点,敢于质疑现状。
  • Delta:技术实干家,能够在现场快速迭代,把想法变成可运行的原型。

与此同时,总部的 核心产品团队 则把这些前线临时拼凑的“碎石路”经验,沉淀为真正的平台功能——就像把碎石铺成的便道逐步升级为可复用的高速公路。

为什么它重要

FDE 模式最大的优势,是能和客户建立极深的合作关系,发现那些任何调研或问卷都无法揭示的真实需求。执行得好,它能形成强大的护城河。但风险同样存在。如果缺乏纪律,FDE 很容易沦为传统咨询或外包。判断是否健康的关键在于:核心产品是否在持续进化?交付效率是否在不断提高?如果只是人海战术的项目交付,那就南辕北辙了。

与咨询的本质区别

关键差异在于:

  • 咨询 只解决一次性问题。
  • FDE 则要求把一线的经验和解决方案反馈到平台中,让产品每服务一个客户就更强大一分。

这种反馈闭环,以及产品经理把定制需求抽象为通用功能的能力,才是 FDE 的真正精髓。

为什么 AI 创业公司都在效仿

对 AI Agent 公司而言,市场过于碎片化和不确定,不存在“通吃型”产品。深度嵌入客户现场,不是可选项,而是唯一的探索路径。唯有如此,才能找到真正的产品形态和市场契合点。

商业模式的变化

传统 SaaS 依赖订阅规模化,而 FDE 合同更偏向结果导向与灵活定价。这里的关键杠杆是 产品杠杆:同样的前线投入,能否带来更大的合同规模,同时不断降低下一次定制的边际成本。

更大的图景

FDE 的流行揭示了现代科技公司的一个悖论:规模化的公司,往往要坚持做那些“无法规模化的事”。AI 的能力正在爆发,但距离真正落地仍有巨大鸿沟。而正是在这个鸿沟里,蕴藏着当下创业公司最大的机会。这不是一条轻松的道路,更像是长期的阵地战,而非一蹴而就的闪电战。但对创业者来说,它或许是唯一可行的道路。

【人工智能】什么是FDE?为何在硅谷爆火? | 前线部署工程师 | Bob McGrew | Palantir | 历史成因 | PMF | 总部产品平台 | Echo&Delta团队 | 历史倒退?


Recently, Y Combinator hosted Bob McGrew, the former Chief Research Officer at OpenAI and a veteran technologist from PayPal and Palantir. What surprised many was the line of questioning. Instead of asking him how to build the next GPT, founders kept pressing him on a very different topic: Palantir’s FDE model.

Bob admitted that over the past year, nearly every startup he’s advised has been obsessed with learning how this model works in practice.

What Exactly Is FDE?

FDE (Forward Deployed Engineer) is a model where engineers embed directly with customers to bridge the gap between what the product aspires to be and what the customer actually needs.

The idea traces back to Palantir’s early days working with U.S. intelligence agencies. The challenges were messy, complex, and had no off-the-shelf solutions. The only way forward was to “build on the ground” with the client. At the time, many dismissed it as unscalable, labor-intensive, and far from the clean SaaS ideal. Fast forward to today, and the very same approach is being embraced by AI startups building agents and enterprise solutions.

How It Works

Palantir structured its FDE teams around two roles:

  • Echo: the industry-savvy operator who lives inside the customer’s workflow, identifies core pain points, and challenges the status quo.
  • Delta: the technical builder who can spin up prototypes quickly, solving problems in real time.

Meanwhile, the core product team back at HQ takes these frontline hacks and turns them into platform features. Think of it as paving a permanent road where the FDEs first laid down gravel.

Why It Matters

The strength of the FDE model is that it forges unusually deep relationships with customers. It surfaces real market demand—things no survey or user interview could ever uncover. Done right, it creates a defensible moat.

But it’s also risky. Without discipline, FDE can collapse into traditional consulting or body-shop outsourcing. The litmus test of a healthy model is whether the core platform keeps evolving, making each new deployment faster, cheaper, and more scalable.

Different from Consulting

The distinction is critical:

  • Consulting delivers one-off solutions.
  • FDE is about feeding learnings back into the product, so the platform gets stronger with every customer.

This feedback loop—and the ability of product managers to abstract from bespoke requests—is what turns customer-specific fixes into reusable product capabilities.

Why AI Startups Love It

For AI Agent companies, the market is far too fragmented and unpredictable for a “one-size-fits-all” solution. No universal product exists. Embedding deeply with customers isn’t optional—it’s the only way to figure out what works, discover product-market fit, and build enduring platforms.

A Shift in Business Models

Unlike traditional SaaS, which scales on pure subscriptions, FDE contracts are more outcome-driven and flexible. The key lever is product leverage: doing the same amount of frontline work but translating it into larger contracts and less marginal customization over time.

The Bigger Picture

The rise of FDE highlights a paradox of modern tech: at scale, the best companies keep doing the things that “don’t scale.” The gulf between breakthrough AI capabilities and messy, real-world adoption is exactly where the biggest opportunities lie today.

It’s not an easy path—more trench warfare than blitzscaling—but for founders, it may be the only one that works.


Watch the full discussion here: The FDE Playbook for AI Startups with Bob McGrew

《大模型精诚》两篇

世有愚者,读方三年,便谓天下无病可治;及治病三年,乃知天下无方可用。

— 【唐】孙思邈《大医精诚》

《大模型精诚(上)》仿照孙思邈的《大医精诚》而作,论述了术之源和道之始。《大模型精诚(下)》继承了上篇的旨趣,进一步阐明了工具的用途;批判了浮夸学术的弊端,强调了明人志向的正直;愿后来的学者能够谨慎守护精诚之道。

  1. 《大模型精诚·上:御术循道》
  2. 《大模型精诚·下:格物穷理》
  3. 《大医精诚》原文
中文摘要 & English Abstract

《大模型精诚》分为上下两篇,仿孙思邈《大医精诚》之文风,提出面对大语言模型(LLMs)之道,须持“精勤”“诚敬”之精神。文章指出,大模型非万能,初学者易为其智能所惑,唯有深入算法原理、洞察数据源头,方能破“表象之惑”,守“正道之用”。技术本无心,人心为舵,若妄施滥用,或将致祸;唯秉“精诚之志”,以智辅仁,方可济世安人。

“On the Sincerity and Mastery in Large Models” is a two-part essay inspired by Sun Simiao’s classical Chinese text On the Absolute Sincerity of Great Physicians. Written in classical Chinese style, it warns against superficial understanding and blind faith in large language models (LLMs). It calls for practitioners to uphold a spirit of diligence (“精”) and sincerity (“诚”)—to understand the inner principles of algorithms and the biases within data. The model is but a tool; its moral compass lies in the human operator. Only by combining technical rigor with ethical restraint can AI serve humanity and avoid causing harm. This is both a philosophical treatise on AI and a critique of today’s hasty tech culture.

《大模型精诚·上:御术循道》

昔者圣贤格物致知,究天人之际,通古今之变,成一家之言。今夫人工智能之兴,盖亦格物之一端也。自其出也,声誉日隆,众人或惊其智,或惧其势,而不知其理者众矣。或操之以利,或役之为器,然利器在手,而不察其锋芒,未必不自伤也。是故观其用者多,究其道者寡。

世有愚者,览模型三月,便谓天下无难可为;及历参数之调三载,方知世无定式可循。故智者必穷其理,探其源,精勤不倦,不得道听途说,或一知半解,便言大道已了,深自误哉!夫模型者,数据瀚海之所凝,千万参数之所成。 初也,对答如流,人以为智;久则偏识横生,虚妄自出。是以不明其本,只见其文,则如沙上筑塔,虽高而易倾;水中捞月,虽美而终空。

夫术者,行之器也;道者,心之正也。有道无术,术尚可求;有术无道,止于术。是故为学者,当怀精诚之心,以格物致知。不为文饰所惑,不为便捷所役。上穷算理之幽,下辨数据之源。善用其利,以济百工;慎防其弊,以安天下。惟敬惟谨,方能行稳致远;术必精诚,方可臻于大成。

噫!大道之行,贵在明理,重在行之。浮华易得,实知难求。若以华辞饰伪,虽一时风光,终非正道;惟有以道御术,以术辅仁,方能不为其所役,而反致其功。愿世之为学者,慎始慎终,毋躁毋惰;内求于诚,外谨于行,不仅以问答取巧,更以格物证理。则大模型之道,不特能济世用事,亦可以砥砺心志,通于大道矣!


《大模型精诚·下:格物穷理》

夫大模型者,非徒技艺之巧,实启万象之门,应世变之机也。机巧肇兴,数十年耕耘,方结斯器,蔚然成势。其志在洞察智能之源,其义在贯通语言之理。通天人,综古今,非术末之流,乃文明之维。治世之助,经纬之翼。然世多趋利者,见术而不求道;窥皮而不识骨。操几行机命,即称智械之师;观几番演示,便誉人类之镜。或视生成若巫术,妄言灵机觉醒,惑众而自昏,笑之可也。

昔圣贤格物致知,寒暑不辍,穷理尽性,尚不敢言道成;今浮学遇器之妙,应一问而百答,便谓智能已极,人力可替,岂不谬哉?夫言虽似人,意非其本;识未通理,情不存诚。模型者,镜也,影也,非圣贤之心也。其构也,采万卷之籍,汇千年之言,亿万试炼,始见一用。然中有偏识之患、虚妄之误、理断之病、语悖之失,不可不察。不明其理,妄施其用,是犹未诊而投毒,害人而不自知也。

故为学者,当守精诚。不为华饰所眩,不为捷径所诱。内怀谨惧之心,外行谨严之道。器不妄用,人不失本。若夫施用之道,尤宜慎审。毋以小试之验,妄推大用之功;毋因偶中之答,遽信全才之能。逢伦理之辩,涉利害之争,当辨是非之界,守正直之心。不可托器以避己责,盖模型无心,惟人有心。算巧犹器,操之在人;人失其正,器失其依,虽有神器,亦可为祸。是故大者不在术之精,而在人之诚与敬也。

噫!世风浮躁,器成于速,工毁于轻。市井谈智者,多竞捷而寡思;创企逐利者,多耀术而失本。若不慎其患,不固其本,大器之用,虽广亦危。惟精勤而博识,惟谨慎而明理;守德以立身,循道以济世。方可保其善用,免其深害。夫技艺者,舟也;德义者,舵也。舟疾无舵,必倾覆于风波。若能以人为本,以智辅仁,引势以用,慎力以行,使模型不妄言,使人心常自省。则虽新器日出,世亦不乱;虽智能骤起,人亦不亡。


《大医精诚》原文

中国【唐】孙思邈(581~682年)所著之《备急千金要方》第一卷,乃是中医学典籍中,论述医德的一篇极重要文献,为习医者所必读。

张湛曰:“夫经方之难精,由来尚矣。今病有内同而外异,亦有内异而外同,故五脏六腑之盈虚,血脉荣卫之通塞,固非耳目之所察,必先诊候以审之。而寸口关尺,有浮沉弦紧之乱;俞穴流注,有高下浅深之差;肌肤筋骨,有厚薄刚柔之异。唯用心精微者,始可与言于兹矣。今以至精至微之事,求之于至粗至浅之思,岂不殆哉!若盈而益之,虚而损之,通而彻之,塞而壅之,寒而冷之,热而温之,是重加其疾,而望其生,吾见其死矣。故医方卜筮,艺能之难精者也。既非神授,何以得其幽微?世有愚者,读方三年,便谓天下无病可治;及治病三年,乃知天下无方可用。故学者必须博极医源,精勤不倦,不得道听途说,而言医道已了,深自误哉!

凡大医治病,必当安神定志,无欲无求,先发大慈恻隐之心,誓愿普救含灵之苦。若有疾厄来求救者,不得问其贵贱贫富,长幼妍蚩,怨亲善友,华夷愚智,普同一等,皆如至亲之想。亦不得瞻前顾后,自虑吉凶,护惜身命,见彼苦恼,若己有之,深心凄怆,勿避险巇,昼夜寒暑,饥渴疲劳,一心赴救,无作工夫行迹之心,如此可做苍生大医,反之则是含灵巨贼。自古明贤治病,多用生命以济危急,虽曰贱畜贵人,至于爱命,人畜一也。损彼益己,物情同患,况于人乎?夫杀生求生,去生更远,吾今此方所以不用生命为药者,良由此也。其虻虫水蛭之属,市有先死者,则市而用之,不在此例。只如鸡卵一物,以其混沌未分,必有大段要急之处,不得已隐忍而用之,能不用者,斯为大哲,亦所不及也。其有患疮痍下痢,臭秽不可瞻视,人所恶见者,但发惭愧、凄怜、忧恤之意,不得起一念蒂芥之心,是吾之志也。

夫大医之体,欲得澄神内视,望之俨然,宽裕汪汪,不皎不昧,省病诊疾,至意深心,详察形候,纤毫勿失,处判针药,无得参差,虽曰病宜速救,要须临事不惑,唯当审谛覃思,不得于性命之上,率而自逞俊快,邀射名誉,甚不仁矣。又到病家,纵绮罗满目,勿左右顾盼;丝竹凑耳,无得似有所娱;珍羞迭荐,食如无味;醽醁兼陈,看有若无。所以尔者,夫一人向隅,满堂不乐,而况病人苦楚,不离斯须,而医者安然欢娱,傲然自得,兹乃人神之所共耻,至人之所不为,斯盖医之本意也。

夫为医之法,不得多语调笑,谈谑喧哗,道说是非,议论人物,炫耀声名,訾毁诸医,自矜己德,偶然治瘥一病,则昂头戴面,而有自许之貌,谓天下无双,此医人之膏肓也。老君曰:人行阳德,人自报之;人行阴德,鬼神报之;人行阳恶,人自报之,人行阴恶,鬼神害之。寻此贰途,阴阳报施,岂诬也哉?

所以医人不得恃己所长,专心经略财物,但作救苦之心,于冥运道中,自感多福者耳。又不得以彼富贵,处以珍贵之药,令彼难求,自眩功能,谅非忠恕之道。志存救济,故亦曲碎论之,学者不可耻言之鄙俚也!”


END

Zuckerberg’s Gamble: Risks and Rewards in AI Talent Acquisition


Mark Zuckerberg’s recent move to bring Alex Wang and his team into Meta represents a bold and strategic maneuver amid the rapid advancement of large models and AGI development. Putting aside the ethical considerations, Zuckerberg’s approach—laying off staff, then offering sky-high compensation packages with a 48-hour ultimatum to Top AI scientists and engineers from OpenAI , alongside Meta’s acquisition of a 49% stake in Scale AI—appears to serve multiple objectives:

1. Undermining Competitors

By poaching key talent from rival companies, Meta not only weakens their R&D teams and disrupts their momentum but also puts pressure on Google, OpenAI, and others to reassess their partnerships with Scale AI. Meta’s investment may further marginalize these competitors by injecting uncertainty into their collaboration with Scale AI.

2. Reinvigorating the Internal Team

Bringing in fresh blood like Alex Wang’s team and Open AI Top talents could reenergize Meta’s existing research units. A successful “talent reset” may help the company gain a competitive edge in the race toward AGI.

3. Enhancing Brand Visibility

Even if the move doesn’t yield immediate results, it has already amplified Meta’s media presence, boosting its reputation as a leader in AI innovation.

From both a talent acquisition and PR standpoint, this appears to be a masterstroke for Meta.


However, the strategy is not without significant risks:

1. Internal Integration and Morale Challenges

The massive compensation packages offered to those talents could trigger resentment among existing employees—especially in the wake of recent layoffs—due to perceived pay inequity. This may lower morale and even accelerate internal attrition. Cultural differences between the incoming and incumbent teams could further complicate internal integration and collaboration.

2. Return on Investment and Performance Pressure

Meta’s substantial investment in Alex Wang and Scale AI comes with high expectations for short-term deliverables. In a domain as uncertain as AGI, both the market and shareholders will be eager for breakthroughs. If Wang’s team fails to deliver measurable progress quickly, Meta could face mounting scrutiny and uncertainty over the ROI.

3. Impacts on Scale AI and the Broader Ecosystem

Alex Wang stepping away as CEO is undoubtedly a major loss for Scale AI, even if he retains a board seat. Leadership transitions and potential talent departures may follow. Moreover, Scale AI’s history of legal and compliance issues could reflect poorly on Meta’s brand—especially if public perception ties Meta to those concerns despite holding only non-voting shares. More broadly, Meta’s aggressive “poaching” approach may escalate the AI talent war, drive up industry-wide costs, and prompt renewed debate over ethics and hiring norms in the AI sector.


Conclusion
Meta’s latest move is undeniably ambitious. While it positions the company aggressively in the AGI race, it also carries notable risks in terms of internal dynamics, ROI pressure, and broader ecosystem disruption. Only time will tell whether this bold gamble pays off.

Is the AI PC a Gimmick or a Faster Carriage?

TL,DL: The post discusses the impact of AI on productivity, particularly through the emergence of AI PCs powered by localized edge AI. It highlights how large language models and the Core Ultra processor enable AI PCs to handle diverse tasks efficiently and securely. The article also touches on the practical applications and benefits of AI PCs in various fields. The comprehensive overview emphasizes the transformative potential of AI PCs and their pivotal role in shaping the future of computing.

Translation from the Source: AI PC 是噱头还是更快的马车?

Is AI a Bubble or a Marketing Gimmick?

Since 2023, everyone has known that AI is very hot, very powerful, and almost magical. It can generate articles with elegant language and write comprehensive reports, easily surpassing 80% or even more of human output. As for text-to-image generation, music composition, and even videos, there are often impressive results. There’s no need to elaborate on its hype…

For professions like designers and copywriters, generative AI has indeed helped them speed up the creative process, eliminating the need to start from scratch. Due to its high efficiency, some people in these positions might even face the worry of losing their jobs. But for ordinary people, aside from being a novelty, AI tools like OpenAI and Stable Diffusion don’t seem to provide much practical help for their work. After all, most people don’t need to write well-structured articles or compose poems regularly. Moreover, after seeing many AI outputs, they often feel that they are mostly correct but useless information—helpful, but not very impactful.

So, when a phone manufacturer says it will no longer produce “traditional phones,” people scoff. When the concept of an AI PC emerges, it’s hard not to see it as a marketing gimmick. However, after walking around the exhibition area at Intel’s 2024 commercial client AI PC product launch, I found AI to be more useful than I imagined. Yes, useful—not needing to be breathtaking, but very useful.

The fundamental change in experience brought by localized edge AI

Since it is a commercial PC, it cannot be separated from the productivity tool attribute. If you don’t buy the latest hardware and can’t run the latest software versions, it’s easy to be labeled as having “low application skills.” Take Excel as an example. The early understanding of efficiency in Excel was using formulas for automatic calculations. Later, it was about macro code for automatic data filtering, sorting, exporting, etc., though this was quite difficult. A few years ago, learning Python seemed to be the trend, and without it, one was not considered competent in data processing. Nowadays, with data visualization being the buzzword, most Excel users have to search for tutorials online and learn on the spot for unfamiliar formulas. Complex operations often require repeated attempts.

So, can adding “AI” to a PC or installing an AI assistant make it trendy? After experiencing it firsthand, I can confirm that the AI PC is far from superficial. There is a company called ExtendOffice, specializing in Office plugins, which effectively solves the pain points of using Excel awkwardly: you just state your intention, and the AI assistant directly performs operations on the Excel sheet, such as currency conversion or encrypting a column of data. There’s no need to figure out which formula or function corresponds to your needs, no need to search for tutorials, and it skips the step-by-step learning process—the AI assistant handles it immediately.

This highlights a particularly critical selling point of the AI PC: localization, and based on that, it can be embedded into workflows and directly participate in processing. We Chinese particularly love learning, always saying “teaching someone to fish is better than giving them a fish,” but the learning curve for fishing is too long. In an AI PC, you can get both the fish and the fishing skills because the fisherman (AI assistant) is always in front of you, not to mention it can also act as a chef or secretary.

Moreover, the “embedding” mentioned earlier is not limited to a specific operation (like adding a column of data or a formula to Excel). It can generate multi-step, cross-software operations. This demonstrates the advantage of large language models: they can accept longer inputs, understand, and break them down. For example, we can tell the AI PC: “Mute the computer, then open the last read document and send it to a certain email.” Notably, as per the current demonstration, there is no need to specify the exact document name; vague instructions are understandable. Another operation that pleasantly surprised me was batch renaming files. In Windows, batch renaming files requires some small techniques and can only change them into regular names (numbers, letter suffixes, etc.). But with the help of an AI assistant, we can make file names more personalized: adding relevant customer names, different styles, etc. This seemingly simple task actually involves looking at each file, extracting key information, and even describing some abstract information based on self-understanding, then individually writing new file names—a very tedious process that becomes time-consuming with many files. With the AI assistant, it’s just a matter of saying a sentence. Understanding longer contexts, multi-modal inputs, etc., all rely on the capabilities of large language models, but this is running locally, not relying on cloud inference. Honestly, no one would think that organizing file names in the local file system requires going to the cloud, right? The hidden breaks between the edge and the cloud indeed limit our imagination, so these local operations of the AI PC really opened my mind.

Compared to the early familiar cloud-based AI tools, localization brings many obvious benefits. For instance, even when offline, natural language processing and other operations can be completed. For those early users who heavily relied on large models and encountered service failures, “the sky is falling” was a pain point. Not to mention scenarios without internet, like on a plane, maintaining continuous availability is a basic need.

Local deployment can also address data security issues. Since the rise of large models, there have been frequent news of companies accidentally leaking data. Using ChatGPT for presentations, code reviews, etc., is great, but it requires uploading documents to the cloud. This has led many companies to outright ban employees from using ChatGPT. Subsequently, many companies chose to train and fine-tune private large models using open-source models and internal data, deploying them on their own servers or cloud hosts. Furthermore, we now see that a large model with 20 billion parameters can be deployed on an AI PC based on the Core Ultra processor.

These large models deployed on AI PCs have already been applied in various vertical fields such as education, law, and medicine, generating knowledge graphs, contracts, legal opinions, and more. For example, inputting a case into ThunderSoft’s Cube intelligent legal assistant can analyze the case, find relevant legal provisions, draft legal documents, etc. In this scenario, the privacy of the case should be absolutely guaranteed, and lawyers wouldn’t dare transmit such documents to the cloud for processing. Doctors have similar constraints. For research based on medical cases and genetic data, conducting genetic target and pharmacological analyses on a PC eliminates the need to purchase servers or deploy private clouds.

Incidentally, the large model on the AI PC also makes training simpler than imagined. Feeding the local files visible to you into the AI assistant can solve the problem of “correct nonsense” that previous chatbots often produced. For example, generating a quote email template with AI is easy, but it’s normal for a robot to not understand key information like prices, which requires human refinement. If a person handles this, preparing a price list in advance is a reasonable requirement, right? Price lists and FAQs need to be summarized and refined, then used to train newcomers more effectively—that’s the traditional view. Local AI makes this simple: let it read the Outlook mailbox, and it will learn the corresponding quotes from historical emails. The generated emails won’t just be template-level but will be complete with key elements. Our job will be to confirm whether the AI’s output is correct. And these learning outcomes can be inherited.

Three Major AI Engines Support Local Large Models

In the information age, we have experienced several major technological transformations. First was the popularization of personal computers, then the internet, and then mobile internet. Now we are facing the empowerment and even restructuring of productivity by AI. The AI we discuss today is not large-scale clusters for training or inference in data centers but the PCs at our fingertips. AIGC, video production, and other applications for content creators have already continuously amazed the public. Now we further see that AI PCs can truly enhance the work efficiency of ordinary office workers: handling trivial tasks, making presentations, writing emails, finding legal provisions, etc., and seamlessly filling in some of our skill gaps, such as using unfamiliar Excel functions, creating supposedly sophisticated knowledge graphs, and so on. All this relies not only on the “intelligent emergence” of large language models but also on sufficiently powerful performance to support local deployment.

We frequently mention the “local deployment” of large models, which relies on strong AI computing power at the edge. The so-called AI PC relies on the powerful CPU+GPU+NPU triad AI engines of the Core Ultra processor, whose computing power is sufficient to support the local operation of a large language model with 20 billion parameters. As for AIGC applications represented by text-to-image generation, they are relatively easy.

Fast CPU Response: The CPU can be used to run traditional, diverse workloads and achieve low latency. The Core Ultra adopts advanced Intel 4 manufacturing process, allowing laptops to have up to 16 cores and 22 threads, with a turbo frequency of up to 5.1GHz.

High GPU Throughput: The GPU is ideal for large workloads that require parallel throughput. The Core Ultra comes standard with Arc GPU integrated graphics. The Core Ultra 7 165H includes 8 Xe-LPG cores (128 vector engines), and the Core Ultra 5 125H includes 7. Moreover, this generation of integrated graphics supports AV1 hardware encoding, enabling faster output of high-quality, high-compression-rate videos. With its leading encoding and decoding capabilities, the Arc GPU has indeed built a good reputation in the video editing industry. With a substantial increase in vector engine capabilities, many content creation ISVs have demonstrated higher efficiency in smart keying, frame interpolation, and other functions based on AI PCs.

Efficient NPU: The newly introduced NPU (Neural Processing Unit) in the Core Ultra provides 10 times the efficiency of traditional CPUs and GPUs in processing AI workloads. As an AI acceleration engine, it allows the NPU to handle high-complexity, high-demand AI workloads, greatly reducing energy consumption.

Edge AI has unlimited possibilities, and its greatest value is precisely in practicality. With sufficient computing power, whether through large-scale language models or other models, it can indeed increase the efficiency of content production and indirectly enhance the operational efficiency of every office worker.

For commercial AI PCs, Intel has also launched the vPro® platform based on Intel® Core™ Ultra, which organically combines AI with the productivity, security, manageability, and stability of the commercial platform. Broadcom demonstrated that vPro-based AI PC intelligent management transforms traditional asset management from passive to proactive: previously, it was only possible to see whether devices were “still there” and “usable,” and operations like patch upgrades were planned; with AI-enhanced vPro, it can autonomously analyze device operation, identify potential issues, automatically match corresponding patch packages, and push suggestions to maintenance personnel. Beirui’s Sunflower has an AI intelligent remote control report solution, where remote monitoring of PCs is no longer just screen recording and capturing but can automatically and in real-time identify and generate remote work records of the computer, including marking sensitive operations such as file deletion and entering specific commands. This significantly reduces the workload of maintenance personnel in checking and tracing records.

The Future is Here: Hundreds of ISVs Realizing Actual Business Applications Henry Ford once commented on the invention of the automobile: “If you ask your customers what they need, they will say they need a faster horse.”

“A faster horse” is a consumer trap. People who think AI phones and AI PCs are just gimmicks might temporarily not see the need to upgrade their horse based on convention. More deeply, the public has some misunderstandings about the implementation of AI, which manifests in two extremes: one extreme thinks it’s something for avant-garde heavy users and flagship configurations, typically in scenarios like image and video processing; the other extreme sees it as refreshing chatbots, like an enhanced search engine, useful but not necessary. In reality, the implementation of AI PCs far exceeds the imagination of many people: for commercial customers, Intel has deeply optimized cooperation with more than 100 ISVs worldwide, and over 35 local ISVs have optimized integration at the terminal, creating a huge AI ecosystem with over 300 ISV features, bringing an unprecedented AI PC experience!

Moreover, I do not think this scale of AI application realization is pie in the sky or “fighting the future.” Because in my eyes, the display of numerous AI PC solutions is like an “OpenVINO™ party.” OpenVINO™ is a cross-platform deep learning toolkit developed by Intel, meaning “Open Visual Inference and Neural Network Optimization.” This toolkit was actually released in 2018, and over the years, it has accumulated a large number of computer vision and deep learning inference applications. By the time of the Iris Xe integrated graphics era, the software and hardware combination already had a strong reputation. For example, relying on a mature algorithm store, various AI applications can be easily built on the 11th generation Core platform, from behavior detection for smart security to automatic inventory checking in stores, with quite good results. Now, as AI PC integrated graphics evolve to Xe-LPG, with doubled computing power, the various applications accumulated by OpenVINO™ will perform even better, achieving the “location” (sustainable Xe engine) and “harmony” (ISV resources of OpenVINO™) that are already in place.

What truly ignites the AI PC is “timing,” namely, the practicalization of large language models. The breakthrough of large language models has effectively solved the problems of natural language interaction and data training, greatly lowering the threshold for ordinary users to utilize AI computing power. Earlier, I cited many examples embedded in office applications. Here, I can give another example: the combination of Kodong Intelligent Controller’s multimodal visual language model with a robotic arm. The robotic arm is a common robot application, which has long been able to perform various operations with machine vision, such as moving and sorting objects. However, traditionally, object recognition and operation require pre-training and programming. With the integration of large language models, the whole system can perform multimodal instruction recognition and execution. For instance, we can say: “Put the phone on that piece of paper.” In this scenario, we no longer need to teach the robot what a phone is, what paper is, do not need to give specific coordinates, and do not need to plan the moving path. Natural language instructions and camera images are well integrated, and execution instructions for the robotic arm are generated automatically. For such industrial scenarios, the entire process can be completed on a laptop-level computing platform, and the data does not need to leave the factory.

Therefore, what AI PC brings us is definitely not just “a faster horse,” but it subverts the way PCs are used and expands the boundaries of user capabilities. Summarizing the existing ISVs and solutions, we can categorize AI PC applications into six major scenarios:

  1. AI Chatbot: More professional Q&A for specific industries and fields.
  2. AI PC Assistant: Directly operates the PC, handling personal files, photos, videos, etc.
  3. AI Office Assistant: Office plugins to enhance office software usage efficiency.
  4. AI Local Knowledge Base: RAG (Retrieval Augmented Generation) applications, including various text and video files.
  5. AI Image and Video Processing: Generation and post-processing of multimedia information such as images, videos, and audio.
  6. AI PC Management: More intelligent and efficient device asset and security management.

Summary

It is undeniable that the development of AI always relies on the technological innovation and combination of hardware and software. AI PCs based on Core Ultra are first of all faster, stronger, lower power consumption, and longer battery life PCs. These hardware features support AI applications that bring deeper changes to our usage experience and modes. PCs empowered with “intelligent emergence” are no longer just productivity tools; in some scenarios, they can directly transform into collaborators or even operators. Behind this are performance improvements brought by microarchitecture and production process advancements, as well as the empowerment of new productivity like large language models.

If we regard CPU, GPU, and NPU as the three major computing powers of AI PCs, correspondingly, the value of AI PCs for localizing AI (on the client side) can be summarized into three major rules: economy, physics, and data confidentiality. The so-called economy means that processing data locally can reduce cloud service costs and optimize economic efficiency; physics corresponds to the “virtual” nature of cloud resources, where local AI services can provide better timeliness, higher accuracy, and avoid transmission bottlenecks between the cloud and the client; data confidentiality means that user data stays completely local, preventing misuse and leakage.

In 2023, the rapid advancement of large language models achieved the AI era in the cloud. In 2024, the client-side implementation of large language models ushered in the AI PC era. We also look forward to AI continuously solidifying applications in the intertwined development of the cloud and the client, continuously releasing powerful productivity; and we look forward to Intel jointly advancing with ISV+OEM in the future to provide us with even stronger “new productivity.”


AI PC 是噱头还是更快的马车?

AI 是虚火还是营销噱头?

2023 年以来,所有人都知道 AI 非常的热、非常的牛、非常的神,生成的文章辞藻华丽、写的报告面面俱到,毫不谦虚地说,打败 80% 甚至更多的人类。至于文生图、作曲,甚至是视频,都常有令人惊艳的作品。吹爆再吹爆,无需赘述……

对于设计师、文案策划等职业,生成式 AI 确实已经帮助他们提高了迸发创意的速度,至少不必万丈高楼平地起了。由于效率太高,这些岗位中的部分人可能反而要面对失业的烦恼。但对于普通人,AI 除了猎奇,OpenAI、SD 等时髦玩意儿好像对工作也没啥实质性的帮助——毕竟平时不需要写什么四平八稳的文章,更不需要吟诗作赋,而且见多了 AI 的输出,也实在觉得多是些正确的废话,有用,但也没啥大用。

所以,当某手机厂商说以后不生产“传统手机”的时候,大家嗤之以鼻。当 AI PC 概念出现的时候,也难免觉得是营销噱头。但是,当我在 2024 英特尔商用客户端 AI PC 产品发布会的展区走了一圈之后,我发现 AI 比我想象中的更有用。是的,有用,不需要技惊四座,但,很有用。

端侧 AI 的本地化落地带来根本性的体验变化

既然是商用 PC,那就离不开生产力工具属性。如果不买最新的硬件,玩不转最新的软件版本,很容易在鄙视链中打上“应用水平低下”的标签。就拿 Excel 为例吧,最早接触 Excel 的时候,对效率的理解是会用公式,自动进行一些计算等。再然后,是宏代码,自动执行数据的筛选、排序、导出等等,但这个难度还是比较大的。前几年呢,又似乎流行起了 Python,不去学一下那都不配谈数据处理了。在言必称数据可视化的当下,多数 Excel 用户的真实情况是尝试陌生的公式都需要临时百度一下教程,现学现用,稍复杂的操作可能要屡败屡试。

那 PC 前面加上 “AI”,或者装上某个 AI 助理,就可以赶时髦了吗?我实际体验之后,确定 AI PC 绝非如此浅薄。在 AI PC 上,有个专门做 Office 插件的公司叫 ExtendOffice,就很好地解决了 Excel 用起来磕磕绊绊的痛点:你只要说出你的意图,AI 助手马上直接在 Excel 表格上进行操作,譬如币值转换,甚至加密某一列数据。不需要去琢磨脑海里的需求到底需要对应哪个公式或者功能才可以实现,不用去查找教程,也跳过了 step by step 的学习,AI 助手当场就处理完了。

这就体现了 AI PC 一个特别关键的卖点:本地化,且在此基础上,可以嵌入工作流程,直接参与处理。我们中国人特别热爱学习,总说“授人以鱼不如授人以渔”,但“渔”的学习曲线太长了。在 AI PC 里,鱼和渔可以同时获得,因为渔夫(AI 助手)随时都在你眼前,更不要说它还可以当厨师、当秘书。

而且,刚才说的“嵌入”并不局限于某一个操作环节(类似于刚才说的给 Excel 增加某一列数据、公式),而是可以生成一个多步骤的、跨软件的操作。这也体现了大语言模型的优势:可以接受较长的输入并理解、分拆。譬如,我们完全可以对 AI PC 说:帮我将电脑静音,然后打开上次阅读的文档,并把它发送给某某邮箱。需要强调的是,以目前的演示,不需要指定准确的文档名,模糊的指示是可以理解的。还有一个让我暗暗叫好的操作是批量修改文件名。在 Windows 下批量修改文件名是需要一些小技巧的,而且,只能改成有规律的文件名(数字、字母后缀)等,但在 AI 助手的帮助下,我们可以让文件名更有个性:分别加上相关客户的名字、不同的风格类型等等。这事说起来简单,但其实需要挨个查看文件、提取关键信息,甚至根据自我理解去描述一些抽象的信息,然后挨个编写新的文件名——过程非常琐碎,文件多了就很费时间,但有了 AI 助手,这就是一句话的事。理解较长的上下文、多模态输入等等,这些都必须依赖大语言模型的能力,但其实是在本地运行的,而非借助云端的推理能力。讲真,应该没有人会认为整理文件名这种本地文件系统的操作还需要去云端绕一圈吧?从端到云之间隐藏的各种断点确实限制了我们的想象力,因此,AI PC 的这些本地操作真的打开了我的思路。

相对于大家早期较为熟悉的基于云端的 AI 工具,本地化还带来了很多显而易见的好处。譬如,断网的情况下,也是可以完成自然语言的处理和其他的操作。这对于那些曾经重度依赖大模型能力,且遭遇过服务故障的早期大模型用户而言,“天塌了”就是痛点。更不要说坐飞机之类的无网络场景了,保持连续的可用性是一个很朴素的需求。

本地部署还可以解决数据安全问题。大模型爆火之初就屡屡传出某某公司不慎泄露数据的新闻。没办法,用 ChatGPT 做简报、检查代码等等确实很香啊,但前提是得把文档上传到云端。这就导致许多企业一刀切禁止员工使用 ChatGPT。后来的事情就是许多企业选择利用开源大模型和内部数据训练、微调私有的大模型,并部署在自有的服务器或云主机上。更进一步的,现在我们看到规模 200 亿参数的大模型可以部署在基于酷睿 Ultra 处理器的 AI PC 上。

这种部署在 AI PC 上的大模型已经涉及教育、法律、医学等多个垂直领域,可以生成包括知识图谱、合同、法律意见等。譬如,将案情输入中科创达的魔方智能法务助手,就可以进行案情分析,查找相关的法律条文,撰写法律文书等。在这个场景中,很显然案情的隐私是应该绝对保证的,律师不敢将这种文档传输到云端处理。医生也有类似的约束,基于病例、基因数据等进行课题研究,如果能够在 PC 上做基因靶点、药理分析等,就不必采购服务器或者部署私有云了。

顺便一提的是,AI PC 上的大模型还让训练变得比想象中要简单,把本地你能看到的文件“喂”给 AI 助理之类的就可以了。这就解决了以往聊天机器人那种活只干了一半的“正确的废话”。譬如,通过 AI 生成一个报价邮件模板是很轻松的,但是,一般来说价格这种关键信息,机器人不懂那是很正常的事情,所以需要人工进行完善。如果找一个人类来处理这种事情,那提前做一份价格表是合理要求吧?报价表、FAQ 等都是属于需要总结提炼的工作,然后才能更有效率地培训新人——这是传统观念。本地的 AI 可以让这个事情变得很简单:让它去读 Outlook 邮箱就好了,片刻之后它自己就从历史邮件中“学”到对应的报价。相应生成的邮件就不仅是模版级了,而是要素完善的,留给我们做的就只剩确认 AI 给的结果是否正确。而且这种学习成果是可以继承下来的。

三大 AI 引擎撑起本地大模型

信息时代,我们已经经历了几次重大的科技变革。首先是个人电脑的普及,然后是互联网的普及,再就是移动互联网。现在我们正在面对的是 AI 对生产力的赋能甚至重构。我们今天讲的 AI 不是在数据中心里做训练或者推理的大规模集群,而是手边的 PC。AIGC、视频制作等面向内容创作者的应用已经不断给予大众诸多震撼了。现在我们进一步看到的是 AI PC 已经可以实实在在的提升普通白领的工作效率:处理琐碎事务,做简报、写邮件、查找法条等等,并且无缝衔接式地补齐我们的一些技能短板,类似于应用我们原本并不熟悉的的 Excel 功能、制作原以为高大上的知识图谱,诸如此类。这一切当然不仅仅依赖于大语言模型的“智能涌现”,也需要足够强大的性能以支撑本地部署。

我们多次提到的大模型的“本地部署”,都离不开端侧强劲的 AI 算力。所谓的 AI PC,依靠的是酷睿 Ultra 处理器强悍的 CPU+GPU+NPU 三大 AI 引擎,其算力足够支持 200 亿参数的大语言模型在本地运行推理过程,至于插图级的文生图为代表的 AIGC 应用相对而言倒是小菜一碟了。
 

  • CPU 快速响应:CPU 可以用来运行传统的、多样化的工作负载,并实现低延迟。酷睿 Ultra 采用先进的 Intel 4 制造工艺,可以让笔记本电脑拥有多达 16 个核心 22 个线程,睿频可高达 5.1GHz。
     
  • GPU 高吞吐量:GPU 非常适合需要并行吞吐量的大型工作负载。酷睿 Ultra 标配 Arc GPU 核显,酷睿 Ultra 7 165H 包含 8 个 Xe-LPG 核心(128 个矢量引擎),酷睿 Ultra5 125H 包含 7 个。而且,这一代核显还支持 AV1 硬编码,可以更快速地输出高质量、高压缩率的视频。凭借领先的编解码能力,Arc GPU 确实在视频剪辑行业积累的良好的口碑。随着矢量引擎能力的大幅度提升,大量内容创作 ISV 的演示了基于 AI PC 的更高效率的智能抠像、插帧等功能。
     
  • NPU 优异能效:酷睿 Ultra 处理器全新引入的 NPU(神经处理单元)能够以低功耗处理持续存在、频繁使用的 AI 工作负载,以确保高能效。譬如,火绒演示了利用 NPU 算力接管以往由 CPU 和 GPU 承担的病毒扫描等工作,虽然速度较调用 GPU 略低,但能耗有明显的优势,特别适合安全这种后台操作。我们已经很熟悉的视频会议中常用的美颜、背景更换、自动居中等操作,也可以交给 NPU 运行。NPU 也完全有能力仅凭一己之力运行轻量级的大语言模型,例如 TinyLlama 1.1,足以满足聊天机器人、智能助手、智能运维等连续性的业务需求,而将 CPU 和 GPU 的资源留给其他业务。
     

针对商用 AI PC,英特尔还推出了基于英特尔® 酷睿™ Ultra 的 vPro® 平台,将 AI 和商用平台的生产力、安全性、可管理性和稳定性有机结合。博通展示的基于 vPro 的 AI PC 智能化管理将传统的资产管理从被动变为主动:以往只能看到设备是否“还在”、“能用”,补丁升级等操作也是计划内的;而 AI 加持的 vPro 可以自主分析设备的运行,从中发现隐患并自动匹配相应的补丁包、向运维人员推送建议等。贝锐向日葵有一个AI智能远控报告方案,对 PC 的远程监控不再仅仅是录屏、截屏,而是可以自动、实时地识别和生成电脑的远程工作记录,包括标记一些敏感操作,如删除文件、输入特定的指令等。这也明显减轻了运维人员检查、回溯记录的工作量。

未来已来:数以百计的 ISV 实际业务落地

亨利福特曾经这样评价汽车的发明:“如果你问你的顾客需要什么,他们会说需要一辆更快的马车。”

“更快的马车”是一种消费陷阱,认为 AI 手机、AI PC 只是噱头的人们可能只是基于惯例认为自己暂时不需要更新马车。更深层次的,是大众对 AI 的落地有一些误解,表现为两种极端:一种极端是认为那是新潮前卫的重度用户、旗舰配置的事情,典型的场景是图像视频处理等;另一种极端是觉得是耳目一新的聊天机器人,类似于强化版的搜索引擎,有更好,无亦可。但实际上,AI PC 的落地情况远超许多人的想象:对于商用客户而言,英特尔与全球超过 100+ 个 ISV 深度优化合作,本土 35+ISV 在终端优化融合,创建包含 300 多项 ISV 特性的庞大 AI 生态系统,带来规模空前的 AI PC 体验!

而且,我并不认为这个数量级的 AI 应用落地是画饼或者“战未来”。因为在我眼里,诸多 AI PC 解决方案的展示,宛如 “OpenVINO™ 联欢会”。OpenVINO™ 是英特尔开发的跨平台深度学习工具包,意即“开放式视觉推理和神经网络优化”。这个工具包其实在 2018 年就已经发布,数年来已经积累了大量计算视觉和深度学习推理应用,发展到 Iris Xe 核显时期,软件、硬件的配合就已经很有江湖地位了。譬如依托成熟的算法商店,基于 11 代酷睿平台可以很轻松的构建各式各样的 AI 应用,从智慧安防的行为检测,到店铺自动盘点,效果相当的好。现在,AI PC 的核显进化到 Xe-LPG,算力倍增,OpenVENO™ 积累的各式应用本身就会有更好的表现,可以说“地利”(具有延续性的 Xe 引擎)和“人和”(OpenVINO™ 的 ISV 资源)早就是现成的。

真正引爆 AI PC 的是“天时”,也就是大语言模型步入实用化。大语言模型的突破很好地解决了自然语言交互和数据训练的问题,极大地降低了普通用户利用 AI 算力的门槛。前面我举了很多嵌入办公应用的例子,在这里,我可以再举一个例子:科东智能控制器的多模态视觉语言模型与机械臂的结合。机械臂是司空见惯的机器人应用,早就可以结合机器视觉做各种操作,移动、分拣物品等等。但物品的识别和操作,传统上是是需要预训练和编程的。结合大语言模型后,整套系统就可以做多模态的指令识别与执行了,譬如我们可以说:把手机放到那张纸上面。在这个场景中,我们不再需要教会机器人手机是什么、纸是什么,不需要给具体的坐标,不需要规划移动的路径。自然语言的指令,摄像头的图像,这些多模态的输入被很好地融合,并自行生成了执行指令给机械臂。对于这样的工业场景,整套流程可以在一台笔记本电脑等级的算力平台上完成,数据不需要出厂。

所以,AI PC 给我们带来的,绝对不仅仅是“更快的马车”,而是颠覆了 PC 的使用模式,拓展了用户的能力边界。盘点已有的 ISV 与解决方案,我们可以将 AI PC 的应用总结为六大场景:
 

  • Al Chatbot:针对特定行业和领域更加专业的问答。
     
  • AI PC 助理:直接对 PC 操作,处理个人文件、照片、视频等。
     
  • Al Office 助手:Office 插件,提升办公软件使用效率。
     
  • AI 本地知识库:RAG(Retrieval Augmented Generation,检索增强生成)应用,包括各类文本和视频文件。
     
  • AI 图像视频处理:图像、视频、音频等多媒体信息的生成与后期处理。
     
  • AI PC 管理:更加智能高效的设备资产及安全管理。

小结

不可否认,AI 的发展永远离不开硬件与软件的技术创新、相互结合,基于酷睿 Ultra 的 AI PC 首先是更快、更强、更低功耗、更长待机的 PC,这些硬件特性支撑的 AI 应用对我们的使用体验、使用模式带来了更深刻的改变。获得“智能涌现”加持的 PC 不再仅仅是生产力工具,在某些场景中,它直接可以化身协作者甚至操作者。这背后既有微架构和生产工艺提升带来的性能改进,也有大语言模型等新质生产力的赋能。

如果我们将 CPU、GPU、NPU 视作是 AI PC 的三大算力,相应的,也可以将 AI PC 让 AI 本地化(端侧)落地的价值归纳为三大法则:经济、物理、数据保密。所谓经济,是数据在本地处理可降低云服务成本,优化经济性;物理则对应云资源的“虚”,本地 AI 服务可以提供更好的及时性,更高的准确性,避免了云与端之间的传输瓶颈;数据保密,是指用户数据完全留在本地,防止滥用和泄露。

在 2023 年,大语言模型的狂飙成就了云端的 AI 元年。2024 年,大语言模型的端侧落地开启了 AI PC 元年。我们也期待 AI 在云与端的交织发展当中不断夯实应用,源源不绝地释放强大生产力;更期待英特尔未来联合 ISV+OEM 共同发力,为我们提供更加强劲的“新质生产力”。

AI Revolutionizes Industry and Retail: From Production Lines to Personalized Shopping Experiences

  1. Industry and Retail Relationship
  2. AI in Industry
  3. AI in Retail
  4. Summary

AI technology is increasingly being utilized in industry and retail sectors to enhance efficiency, productivity, and customer experiences. In this post, we firstly revisit the relationship between the industry and retail sections, then provide some common AI technologies and applications used in these domains.

Industry and Retail Relationship

The key difference between industry and retail lies in their primary functions and the nature of their operations:

Industry:

  • Industry, often referred to as manufacturing or production, involves the creation, extraction, or processing of raw materials and the transformation of these materials into finished goods or products.
  • Industrial businesses are typically involved in activities like manufacturing, mining, construction, or agriculture.
  • The primary focus of the industry is to produce goods on a large scale, which are then sold to other businesses, wholesalers, or retailers. These goods are often used as inputs for other industries or for further processing.
  • Industries may have complex production processes, rely on machinery and technology, and require substantial capital investment.

Retail:

  • Retail, on the other hand, involves the sale of finished products or goods directly to the end consumers for personal use. Retailers act as intermediaries between manufacturers or wholesalers and the end customers.
  • Retailers can take various forms, including physical stores, e-commerce websites, supermarkets, boutiques, and more.
  • Retailers may carry a wide range of products, including those manufactured by various industries. They focus on providing a convenient and accessible point of purchase for consumers.
  • Retail operations are primarily concerned with merchandising, marketing, customer service, inventory management, and creating a satisfying shopping experience for consumers.

AI in Industry

AI, or artificial intelligence, is revolutionizing industry sectors by powering various applications and technologies that enhance efficiency, productivity, and customer experiences. Here are some common AI technologies and applications used in these domains:

1. Robotics and Automation: AI-driven robots and automation systems are used in manufacturing to perform repetitive, high-precision tasks, such as assembly, welding, and quality control. Machine learning algorithms enable these robots to adapt and improve their performance over time.

2. Predictive Maintenance: AI is used to predict when industrial equipment, such as machinery or vehicles, is likely to fail. This allows companies to schedule maintenance proactively, reducing downtime and maintenance costs.

3. Quality Control: Computer vision and machine learning algorithms are employed for quality control processes. They can quickly identify defects or irregularities in products, reducing the number of faulty items reaching the market.

4. Supply Chain Optimization: AI helps in optimizing the supply chain by predicting demand, managing inventory, and optimizing routes for logistics and transportation.

5. Process Optimization: AI can optimize manufacturing processes by adjusting parameters in real time to increase efficiency and reduce energy consumption.

6. Safety and Compliance: AI-driven systems can monitor and enhance workplace safety, ensuring that industrial facilities comply with regulations and safety standards.


AI in Retail

AI technology is revolutionizing the retail sector too, introducing innovative solutions and transforming the way businesses engage with customers. Here are some key AI technologies and applications used in retail:

1. Personalized Marketing: AI is used to analyze customer data and behaviours to provide personalized product recommendations, targeted marketing campaigns, and customized shopping experiences.

2. Chatbots and Virtual Assistants: Retailers employ AI-powered chatbots and virtual assistants to provide customer support, answer queries, and assist with online shopping.

3. Inventory Management: AI can optimize inventory levels and replenishment by analyzing sales data and demand patterns, reducing stockouts and overstock situations.

4. Price Optimization: Retailers use AI to dynamically adjust prices based on various factors, such as demand, competition, and customer behaviour, to maximize revenue and profits.

5. Visual Search and Image Recognition: AI enables visual search in e-commerce, allowing customers to find products by uploading images or using images they find online.

6. Supply Chain and Logistics: AI helps optimize supply chain operations, route planning, and warehouse management, improving efficiency and reducing costs.

7. In-Store Analytics: AI-powered systems can analyze in-store customer behaviour, enabling retailers to improve store layouts, planogram designs, and customer engagement strategies.

8. Fraud Detection: AI is used to detect and prevent fraudulent activities, such as credit card fraud and return fraud, to protect both retailers and customers.

Summary

AI’s potential to transform industry and retail is huge and its future applications are very promising. As AI technologies advance, we can expect increased levels of automation, personalization, and optimization in industry and retail operations.

AI technologies in these sectors often rely on machine learning (ML), deep learning (DL), natural language processing (NLP), and computer vision (CV), and now Generative Large Language Models (LLM) to analyze and gain insights from data. These AI applications are continuously evolving and are changing the way businesses in these sectors operate, leading to improved processes and customer experiences.

AI will drive high levels of efficiency, innovation, and customer satisfaction in these sectors, ultimately revolutionizing the way businesses operate and interact with consumers.


Enigma – Mission X Challenge Accomplished with Python

Enigma M3 from 101 computing: https://www.101computing.net/enigma/
GitHub Repo: https://github.com/cuicaihao/Enigma-Mission-X

Short Summary

Inspired by Enigma – Mission X Challenge, this repo is used to save the research and practice efforts in Different Cipher methods.

The primary goals are using Python programming language to achieve targets listed as follows in Jupyter Notebooks:

Example

  • German Navy Ciphertext by Enigma M3: OJSBI BUPKA ECMEE ZH
  • German Message: Ziel hafen von DOVER
  • English Translation: Target port of DOVER
Enigma Mission – X

By running the notebook, it is not difficult to complete the deciphering process with the “keys” to get the original message from the ciphertext by the German Navy.

Notebook Outputs Example

However, it will be difficult to break down the cipher without knowing the keys. That will be the Turing-Welchman Bombe Simulator challenge.

About Enigma Mission X

Mission X is a game for programmers to accomplish the deciphering job required by Dr Alan Turing.

Mission X Letter from Alan Turning

Programmers need to break the secret with limited information as follows.

Example Message from German Navy

END

Technical Review 04: Human-Computer Interface from In-Context Learning to Instruct Understanding

  1. AI Assitant Summary
  2. Interface with LLM
  3. The Mysterious In-Context Learning
  4. Magical Instruct understanding
    1. Type 1: Academic Research Oriented Instruct
    2. Type 2: Human/Customer Needs Orented Instruct
  5. In Context Learning & Instruct Connection
  6. What’s Next?

AI Assitant Summary

The post first discusses different interface technologies used to connect people with language models. These include zero-shot prompting, few-shot prompting, in-context learning, and instruction. It explains the differences between zero-shot and few-shot learning and their advantages and limitations.

Next, it explores the concept of in-context learning, where language models can predict new examples by looking at existing ones without changing their parameters. It compares in-context learning with fine-tuning and highlights the differences between the two approaches.

The post then focuses on instructing understanding, dividing it into two categories: research-oriented and human/customer needs-oriented instruction. It emphasizes the importance of considering actual user needs in instruct-based tasks.

Lastly, it suggests a possible connection between in-context learning and instruction, proposing that language models could generate task descriptions based on real task instances. It mentions a study that shows improved performance when using instruction derived from this method.

Interface with LLM

Generally, the interface technologies between people and LLM that we often mention include zero-shot prompting, few-shot prompting, In-Context Learning, and instruction. These are actually ways of describing a specific task. But if you look at the literature, you will find that the names are quite confusing.

Zero-shot learning simply feeds the task text to the model and asks for results.

Text: i'll bet the video game is a lot more fun than the film.
Sentiment:

Few-shot learning presents a set of high-quality demonstrations, each consisting of both input and desired output, on the target task. As the model first sees good examples, it can better understand human intention and criteria for what kinds of answers are wanted. Therefore, few-shot learning often leads to better performance than zero-shot. However, it comes at the cost of more token consumption and may hit the context length limit when the input and output text are long.

Text: (lawrence bounces) all over the stage, dancing, running, sweating, mopping his face and generally displaying the wacky talent that brought him fame in the first place.
Sentiment: positive

Text: despite all evidence to the contrary, this clunker has somehow managed to pose as an actual feature movie, the kind that charges full admission and gets hyped on tv and purports to amuse small children and ostensible adults.
Sentiment: negative

Text: for the first time in years, de niro digs deep emotionally, perhaps because he's been stirred by the powerful work of his co-stars.
Sentiment: positive

Text: i'll bet the video game is a lot more fun than the film.
Sentiment:

Among them, Instruct is the interface method of ChatGPT, which means that people give descriptions of tasks in natural language, such as

Translate this sentence from Chinese to English:
....

“Zero-shot prompting” used to be called “zero-shot” in the past, but now it’s commonly referred to as “Instruct.” The two terms have the same meaning but there are two different methods involved.

When interacting with instruction models, we should describe the task requirement in detail, trying to be specific and precise and avoiding saying “not do something” but rather specify what to do.

Please label the sentiment towards the movie of the given movie review. The sentiment label should be "positive" or "negative". 
Text: i'll bet the video game is a lot more fun than the film.
Sentiment:

Explaining the desired audience is another smart way to give instructions. For example to produce educational materials for kids, and safe content,

Describe what is quantum physics to a 6-year-old.

.. in language that is safe for work.

In-context instruction learning (Ye et al. 2023) combines few-shot learning with instruction prompting. It incorporates multiple demonstration examples across different tasks in the prompt, each demonstration consisting of instruction, task input, and output. Note that their experiments were only on classification tasks and the instruction prompt contains all label options.

Definition: Determine the speaker of the dialogue, "agent" or "customer".
Input: I have successfully booked your tickets.
Ouput: agent

Definition: Determine which category the question asks for, "Quantity" or "Location".
Input: What's the oldest building in US?
Ouput: Location

Definition: Classify the sentiment of the given movie review, "positive" or "negative".
Input: i'll bet the video game is a lot more fun than the film.
Output:

In the early days, people would attempt to express a task by using different words or sentences, continually refining their approach. This method was effective for fitting the training data, without considering the distribution. The current approach is to give a specific command statement and aim for the language model to understand it. Both methods involve expressing tasks, but the underlying ideas behind them are distinct.

In Context Learning and few-shot prompting have a similar meaning. They both involve providing examples to a language model and using them to solve new problems.

In my opinion, In Context Learning can be seen as a specific task, while Instruct is a more abstract method of describing tasks. However, the usage of these terms can be confusing, and this understanding is just my personal opinion. Therefore, I will only discuss In Context Learning and Instruct here, and no longer mention zero-shot and few-shot anymore.

The Mysterious In-Context Learning

If you think about it carefully, you will find that In Context Learning is a very magical technology. What’s so magical about it?

The magic is that when you provide LLM with several sample examples {<x1,y1>, <x2, y2>, …, <xn, yn> }, and then give {<x_n+1>} to it, LLM can successfully predict the corresponding ones {<y_n+1>}.

When you hear this, you might ask: What’s so magical about this? Isn’t that how fine-tuning works? If you ask this, it means you haven’t thought deeply enough about this issue.

Fine-tuning and In Context Learning both seem to provide some examples to LLM, but they are qualitatively different (refer to the figure above): Fine-tuning uses these examples as training data and uses backpropagation to modify LLM. The model parameters and the action of modifying the model parameters reflect the process of LLM learning from these examples.

However, In Context Learning only took out examples for LLM to take a look at, and did not use backpropagation to modify the parameters of the LLM model based on the examples, and asked it to predict new examples. Since the model parameters have not been modified, this means that it seems that LLM has not gone through a learning process. If it has not gone through a learning process, then why can it predict new examples just by looking at it?

This is the magic of In Context Learning. Does this remind you of a lyric: “Just because I took one more look at you in the crowd, I can never forget your face again.” The song is called “Legend”. Are you saying it is legendary or not?

It seems that In Context Learning does not learn knowledge from examples. In fact, does LLM learn strangely? Or is it true that it didn’t learn anything? The answer to this question is still an unsolved mystery. Some existing studies have different versions, and it is difficult to judge which one tells the truth. Some research conclusions are even contradictory.

Here are some current opinions. As for who is right and who is wrong, you can only decide for yourself. Of course, I think pursuing the truth behind this magical phenomenon is a good research topic.

Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? try to prove In Context Learning is not learning a mapping between x and y.

It was discovered that in the sample examples {<xi, yi>} provided to LLM, it does not actually matter whether the corresponding correct answer is yi. If we replace the correct answer with another random answer, this does not affect the effect of In Context Learning.

What really has a greater impact on In Context Learning is the distribution of x and y, that is, the distribution of the input text x and the candidate answers y. If you change these two distributions, for example, replace y with something other than the candidate answer. , then the In Context Learning effect drops sharply.

In short, this work proves that In Context Learning does not learn the mapping function, but the distribution of input and output is very important, and these two cannot be changed randomly.

Magical Instruct understanding

We can regard “Instruct” as a task description that is convenient for human beings to understand. Under this premise, the current research on “Instruct” can be divided into two types: “Instruct” which is more academic research, and “Instruct” which describes human real needs.

Fine-tuned Language Models Are Zero-Shot Learners (FLAN)

Type 1: Academic Research Oriented Instruct

Let’s look at the first type “Instruct” which is more academically research-oriented. Its core research theme is the generalization ability of the LLM model to understand “Instruct” in multi-task scenarios.

As shown in the FLAN model in the figure above, that is to say, there are many NLP tasks. For each task, the researchers construct one or more Prompt templates as the Instruct of the task and then use training examples to fine-tune the LLM model so that LLM can learn multiple tasks at the same time. task.

After training the model, give the LLM model instruction for a brand-new task that it has never seen before, and then let LLM solve the zero-shot task. Based on whether the task is solved well enough, we can judge whether the LLM model has the generalization ability to understand the Instruct.

Research findings suggest several factors that can significantly enhance the generalization capabilities of the Language Models Instruction (LLM). To augment the model’s instructional comprehension, the following strategies have proven effective: increasing the number of multi-tasking tasks, expanding the size of the LLM model, implementing CoT Prompting, and diversifying the range of tasks. By incorporating these measures, the LLM model can substantially improve its capacity to understand instructions.

Type 2: Human/Customer Needs Orented Instruct

The second type is instruction based on real human needs. This type of research is represented by InstructGPT and ChatGPT. This type of work is also based on multi-tasking, but the biggest difference from academic research-oriented work is that it is oriented to the real needs of human users.

Why do you say that? Because the task description prompts they use for LLM multi-task training are sampled from real requests submitted by a large number of users, instead of fixing the scope of the research task and then letting researchers write the task description prompts.

The so-called “real needs” here are reflected in two aspects: first, because they are randomly selected from the task descriptions submitted by users, the types of tasks covered are more diverse and more in line with the real needs of users; second, a certain prompt description of a task is submitted by the user and reflects what ordinary users would say when expressing task requirements, not what you think users would say. Obviously, the user experience of the LLM model modified by this type of work will be better.

In the InstructGPT paper, this method is also compared with the Instruct-based method of FLAN. First, fine-tune the tasks, data, and Prompt template mentioned by FLAN on GPT3 to reproduce the FLAN method on GPT3, and then compare it with InstructGPT. Because the basic model of InstructGPT is also GPT3, there are only differences in data and methods. The two are comparable, and it is found that the effect of the FLAN method is far behind InstructGPT.

So what’s the reason behind it? After analyzing the data, the paper believes that the FLAN method involves relatively few task fields and is a subset of the fields involved in InstructGPT, so the effect is not good. In other words, the tasks involved in the FLAN paper are inconsistent with the actual needs of users, which results in insufficient results in real scenarios. This means that it is very important to collect real needs from user data.

In Context Learning & Instruct Connection

If we assume that In Context Learning uses some examples to concretely express task commands, Instruct is an abstract task description that is more in line with human habits.

So, a natural question is: is there any connection between them? For example, can we provide LLM with several specific examples of completing a certain task and let LLM find the corresponding Instruct command described in natural language? (aka, Can LLM create the instruct command for itself by watching the human involved process)

There’s actually some work being done on this issue here and there, and I think it’s a really interesting research direction.

Let’s talk about the answer first. The answer is: Yes, LLM can.

Large Language Models are Human-Level Prompt Engineers is a very interesting job in this direction.

As shown in the figure above, for a certain task, give LLM some examples, let LLM automatically generate natural language commands that can describe the task, and then it use the task description generated by LLM to test the task’s effectiveness.

The basic models it uses are GPT 3 and InstructGPT. After the blessing of this technology, the effect of Instruct generated by LLM is greatly improved compared to GPT 3 and InstructGPT which do not use this technology, and in some tasks Superhuman performance.

This shows that there is a mysterious inner connection between concrete task examples and natural language descriptions of tasks. As for what exactly this connection is? We don’t know anything solid conclusions about this yet.

What’s Next?

Technical Review 05: How to Enhance LLM’s Reasoning Ability

Previous Blogs:

Technical Review 03: Scale Effects & What happens when LLMs get bigger and bigger

  1. AI Assitant Summary
  2. Introduction
  3. Part One: pre-training phase
    1. Open AI
    2. Deep Mind
  4. Part Two: downstream tasks
    1. Linearity Tasks
    2. Breakthroughs Tasks
    3. U-shaped Tasks
  5. Personal View
  6. Reference
  7. What’s Next?

AI Assitant Summary

This blog discusses the scale of Large Language Models (LLMs) and their impact on performance. LLMs like GPT, LaMDA, and PaLM have billions of parameters, raising questions about the consequences of their continued growth.

The journey of an LLM involves two stages: pre-training and scenario application. Pre-training focuses on optimizing the model using cross-entropy, while scenario application evaluates the model’s performance in specific use cases. Evaluating an LLM’s quality requires considering both stages, rather than relying solely on pre-training indicators.

Increasing training data, model parameters, and training time has been found to enhance performance in the pre-training stage. OpenAI and DeepMind have explored this issue, with OpenAI finding that a combination of more data and parameters, along with fewer training steps, produces the best results. DeepMind considers the amount of training data and model parameters equally important.

The influence of model size on downstream tasks varies. Linear tasks show consistent improvement as the model scales, while breakthrough tasks only benefit from larger models once they reach a critical scale. Tasks involving logical reasoning demonstrate sudden improvement at specific model scales. Some tasks exhibit U-shaped growth, where performance initially declines but then improves with larger models.

Reducing the LLM’s parameters while increasing training data proportionally can decrease the model’s size without sacrificing performance, leading to faster inference speed.

Understanding the impact of model size on both pre-training and downstream tasks is vital for optimizing LLM performance and exploring the potential of these language models.

Introduction

In recent years, we’ve witnessed a surge in the size of Large Language Models (LLMs), with models now boasting over 100 billion parameters becoming the new standard. Think OpenAI’s GPT-3 (175B), Google’s LaMDA (137B), PaLM (540B), and other global heavyweights. China, too, contributes to this landscape with models like Zhiyuan GLM, Huawei’s “Pangu,” Baidu’s “Wenxin,” etc. But here’s the big question: What unfolds as these LLMs continue to grow?

The journey of pre-trained models involves two crucial stages: pre-training and scenario application.

In the pre-training stage, the optimization goal is cross entropy. For autoregressive language models such as GPT, it is to see whether LLM correctly predicts the next word;

However, the real test comes in the scenario application stage, where specific use cases dictate evaluation criteria. Generally, our intuition is that if the LLM has better indicators in the pre-training stage, its ability to solve downstream tasks will naturally be stronger. However, this is not entirely true.

Existing research has proven that the optimization index in the pre-training stage does show a positive correlation with downstream tasks, but it is not completely positive. In other words, it is not enough to only look at the indicators in the pre-training stage to judge whether an LLM model is good enough. Based on this, we will look separately at these two different stages to see what the impact will be as the LLM model increases.

Part One: pre-training phase

First, let’s look at what happens as the model size gradually increases during the pre-training stage. OpenAI specifically studied this issue in “Scaling Laws for Neural Language Models” and proposed the “scaling law” followed by the LLM model.

Source: Scaling Laws for Neural Language Models

As shown in the figure above, this study proves that when we independently increase (1) the amount of training data, (2) model parameter size and (3) extend the model training time (such as from 1 Epoch to 2 Epochs), the Loss of the pre-trained model on the test set will decrease monotonically. In other words, the model’s effectiveness is improving steadily.

Since all three factors are important when we actually do pre-training, we have a decision-making problem on how to allocate computing power:

Question: Assuming that the total computing power budget used to train LLM (such as fixed GPU hours or GPU days) is given. How to allocate computing power?

Should we increase the amount of data and reduce model parameters?

Or should we increase the amount of data and model size at the same time but reduce the number of training steps?

Open AI

As one zero-sum game, the scale of one-factor increases, and the scale of other factors must be reduced to keep the total computing power unchanged, so there are various possible computing power allocation plans.

In the end, OpenAI chose to increase the amount of training data and model parameters at the same time but used an early stopping strategy to reduce the number of training steps. Because it proves that: for the two elements of training data volume and model parameters, if you only increase one of them separately, this is not the best choice. It is better to increase both at the same time according to a certain proportion. Its conclusion is to give priority to increasing the model parameters, and then the amount of training data.

Assuming that the total computing power budget used to train LLM increases by 10 times, then the amount of model parameters should be increased by 5.5 times and the amount of training data should be increased by 1.8 times. At this time, the model gets the best performance.

Deep Mind

A study by DeepMind (Reference: Training Compute-Optimal Large Language Models) explored this issue in more depth.

Source: Training Compute-Optimal Large Language Models

Its basic conclusions are similar to those of OpenAI. For example, it is indeed necessary to increase the amount of training data and model parameters at the same time, so that the model effect will be better.

Many large models do not consider this when doing pre-training. Many large LLM models were trained just monotonically increasing the model parameters while fixing the amount of training data. This approach is wrong and limits the potential of the LLM model.

However, DeepMind corrects the proportional relationship between the two by OpenAI and believes that the amount of training data and model parameters are equally important.

In other words, assuming that the total computing power budget used to train LLM increases by 10 times, the number of model parameters should be increased by 3.3 times, and the amount of training data should also be increased by 3.3 times to get the best model.

This means that increasing the amount of training data is more important than we previously thought. Based on this understanding, DeepMind chose another configuration in terms of computing power allocation when designing the Chinchilla model: compared with the Gopher model with a data volume of 300B and a model parameter volume of 280B, Chinchilla chose to increase the training data by 4 times, but reduced the model The parameters are reduced to one-fourth that of Gopher, which is about 70B. However, regardless of pre-training indicators or many downstream task indicators, Chinchilla is better than the larger Gopher.

This brings us to the following enlightenment:

We can choose to enlarge the training data and reduce the LLM model parameters in the same proportion to achieve the purpose of greatly reducing the size of the model without reducing the model performance.

Reducing the size of the model has many benefits, such as the inference speed will be much faster when applied. This is undoubtedly a promising development route for LLM.

Part Two: downstream tasks

The above is the impact of the model scale from the pre-training stage. From the perspective of the effect of LLM on solving specific downstream tasks, as the model scale increases, different types of tasks have different performances.

Source: Beyond the Imitation Game Benchmark

Specifically, there are the following three types of tasks.

  • (a) Tasks that achieve the highest linearity scores see model performance improve predictably with scale and typically rely on knowledge and simple textual manipulations.
  • (b) Tasks with high breakthroughs do not see model performance improve until the model reaches a critical scale. These tasks generally require sequential steps or logical reasoning. Around 5% of BIG-bench tasks see models achieve sudden score breakthroughs with increasing scale.
  • (c) Tasks that achieve the lowest (negative) linearity scores see model performance degrade with scale.

Linearity Tasks

The first type of task perfectly reflects the scaling law of the LLM model, which means that as the model scale gradually increases, the performance of the tasks gets better and better, as shown in (a) above.

Such tasks usually have the following common characteristics: they are often knowledge-intensive tasks. That is to say, if the LLM model contains more knowledge, the performance of such tasks will be better.

Many studies have proven that the larger the LLM model, the higher the learning efficiency. For the same amount of training data, the larger the model, the better the performance. This shows that even when faced with the same batch of training data, a larger LLM model is relatively more efficient in getting more knowledge than small ones.

What’s more, under normal circumstances, when increasing the LLM model parameters, the amount of training data will often increase simultaneously, which means that large models can learn more knowledge points from more data. These studies can explain the above figure, why as the model size increases, these knowledge-intensive tasks become better and better.

Most traditional NLP tasks are actually knowledge-intensive tasks, and many tasks have achieved great improvement in the past few years, even surpassing human performance. Obviously, this is most likely caused by the increase in the scale of the LLM model, rather than due to a specific technical improvement.

Breakthroughs Tasks

The second type of task demonstrates that LLM has some kind of “Emergent Ability”, as shown in (b) above. The so-called “emergent ability” means that when the model parameter scale fails to reach a certain threshold, the model basically does not have any ability to solve such tasks, which reflects that its performance is equivalent to randomly selecting answers. However, when the model scale spans Once the threshold is exceeded, the LLM model’s effect on such tasks will experience a sudden performance increase.

In other words, model size is the key to unlocking (unlocking) new capabilities of LLM. As the model size becomes larger and larger, more and more new capabilities of LLM will be gradually unlocked.

This is a very magical phenomenon because it means the following possibilities that make people optimistic about the future. Many tasks that cannot be solved well by LLM at present can be solved in future if we continue to make the model larger. Because LLM has “emergent capabilities” to suddenly unlock those limits one day. The growth of the LLM model will bring us unexpected and wonderful gifts.

The article “Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models” points out that tasks that embody “emergent capabilities” also have some common features: these tasks generally consist of multiple steps, and to solve these tasks, it is often necessary to first Multiple intermediate steps are solved, and logical reasoning skills play an important role in the final solution of such tasks.

Chain of Thought (CoT) Prompting is a typical technology that enhances the reasoning ability of LLM, which can greatly improve the effect of such tasks. I will discuss the CoT technology in the following blogs.

Here the most important question is, why does LLM have this “emergent ability” phenomenon? The article “Emergent Abilities of Large Language Models” shares several possible explanations:

Source: Emergent Abilities of Large Language Models

One possible explanation is that the evaluation indicators of some tasks are not smooth enough. For example, some metrics for generation tasks require that the string output by the model must completely match the standard answer to be considered correct otherwise it will be scored zero.

Thus, even as the model gradually becomes better and outputs more correct character fragments, because it is not completely correct, 0 points will be given for any small errors. Only when the model is large enough, the output Scores are scored when all the output segments are correct. In other words, because the indicator is not smooth enough, it cannot reflect the reality that LLM is actually gradually improving its performance on the task. It seems to be an external manifestation of “emergent ability”.

Another possible explanation is that some tasks are composed of several intermediate steps. As the size of the model increases, the ability to solve each step gradually increases, but as long as one intermediate step is wrong, the final answer will be wrong. This will also lead to this superficial “emergent ability” phenomenon.

Of course, the above explanations are still conjectures at present. As for why LLM has this phenomenon, further and in-depth research is needed.

U-shaped Tasks

Source: Inverse scaling can become U-shaped

There are also a small number of tasks. As the model size increases, the task effect curve shows U-shaped characteristics: as the model size gradually increases, the task effect gradually becomes worse, but when the model size further increases, the effect starts to get better and better. Figure above shows a U-shaped growth trend where the indicator trend of the pink PaLM model on the two tasks.

Why do these tasks appear so special? The article “Inverse Scaling Can Become U-shaped” gives an explanation:

These tasks actually contain two different types of subtasks, one is the real task, and the other is the “interference task ( distractor task)”.

  • When the model size is small, it cannot identify any sub-task, so the performance of the model is similar to randomly selecting answers.
  • When the model grows to a medium size, it mainly tries to solve the interference task, so it has a negative impact on the real task performance. This is reflected in the decline of the real task effect.
  • When the model size is further increased, LLM can ignore the interfering task and perform the real task, which is reflected in the effect starting to grow.

For those tasks whose performance has been declining as the model size increases, if Chain of Thought (CoT) Prompting is used, the performance of some tasks will be converted to follow the Scaling Law. That is, the larger the model size, the better the performance, while other tasks will be converted to a U-shaped growth curve.

This actually shows that this type of task should be a reasoning-type task, so the task performance will change qualitatively after adding CoT.

Personal View

Increasing the size of the LLM model may not seem technically significant, but it is actually very important to build better LLMs. In my opinion, the advancements from Bert to GPT 3 and ChatGPT are likely attributed to the growth of the LLM model size rather than a specific technology. I believe a lot of people want to explore the scale ceiling of the LLM model if possible.

The key to achieving AGI may lie in having large and diverse data, large-scale models, and rigorous training processes. Developing such large LLM models requires high engineering skills from the technical team, which means there is technical content involved.

Increasing the scale of the LLM model has research significance. There are two main reasons why it is valuable.

  • Firstly, as the model size grows, the performance of various tasks improves, especially for knowledge-intensive tasks. Additionally, for reasoning and difficult tasks, the effect of adding CoT Prompting follows a scaling law. Therefore, it is important to determine to what extent the scale effect of LLM can solve these tasks.
  • Secondly, the “emergent ability” of LLM suggests that increasing the model size may unlock new capabilities that we did not expect. This raises the question of what these capabilities could be.

Considering these factors, it is necessary to continue increasing the model size to explore the limits of its ability to solve different tasks.

Talk is cheap, and in reality, very few AI/ML practitioners have the opportunity or ability to build larger models due to high financial requirements, investment willingness, engineering capabilities, and technical enthusiasm from research institutions. There are probably no more than 10 institutions that can do this on Earth. However, in the future, there may be a possibility of joint efforts between capable institutions to build a Super-Large model:

All (Resources) for One (Model) and One (Model) for All (People).

Modified from Alexandre Dumas, The Three Musketeers

Reference

  1. OpenAI 2020: Scaling Laws for Neural Language Models (https://arxiv.org/abs/2001.08361)
  2. DeepMind 2022: Training Compute-Optimal Large Language Models (https://arxiv.org/abs/2203.15556)
  3. BIG-bench Project Team: 2023: Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models (https://arxiv.org/abs/2206.04615)
  4. Google 2023: Inverse scaling can become U-shaped (https://arxiv.org/abs/2211.02011)

What’s Next?

Technical Review 04: Human-Computer Interface: From In Context Learning to Instruct Understanding (ChatGPT)

Previous Blogs

Technical Review 01: Large Language Model (LLM) and NLP Research Paradigm Transformation

TL;DR: This blog explores the profound influence of ChatGPT, awakening various sectors – the general public, academia, and industry – to the developmental philosophy of Large Language Models (LLMs). It delves into OpenAI’s prominent role and analyzes the transformative effect of LLMs on Natural Language Processing (NLP) research paradigms. Additionally, it contemplates future prospects for the ideal LLM.

  1. AI Assitant Summary
  2. Introduction
  3. NLP Research Paradigm Transformation
    1. Paradigm Shift 1.0 (2013): From Deep Learning to Two-Stage Pre-trained Models
      1. Impact 1: The Decline of Intermediate Tasks
      2. Impact 2: Standardization of Technical Approaches Across All Areas
    2. Paradigm Shift 2.0 (2020): Moving from Pre-Trained Models to General Artificial Intelligence (AGI)
  4. The Ideal Large Language Model (LLM)
    1. Impact 1: LLM Adapting Humans NEEDS with Natural Interfaces
    2. Impact 2: Many NLP subfields no longer have independent research value
    3. Impact 3: More research fields other than NLP will be included in the LLM technology system
  5. Reference

AI Assitant Summary

This blog discusses the impact of ChatGPT and the awakening it brought to the understanding of Large Language Models (LLM). It emphasizes the importance of the development philosophy behind LLM and notes OpenAI’s leading position, followed by Google, with DeepMind and Meta catching up. The article highlights OpenAI’s contributions to LLM technology and the global hierarchy in this domain.

What is Gen-AI’s superpower?

The blog is divided into two main sections: the NLP research paradigm transformation and the ideal Large Language Model (LLM).

In the NLP research paradigm transformation section, there are two significant paradigm shifts discussed. The first shift, from deep learning to two-stage pre-trained models, marked the introduction of models like Bert and GPT. This shift led to the decline of intermediate tasks in NLP and the standardization of technical approaches across different NLP subfields.

The second paradigm shift focuses on the move from pre-trained models to General Artificial Intelligence (AGI). The blog highlights the impact of ChatGPT in bridging the gap between humans and LLMs, allowing LLMs to adapt to human commands and preferences. It also suggests that many independently existing NLP research fields will be incorporated into the LLM technology system, while other fields outside of NLP will also be included. The ultimate goal is to achieve an ideal LLM that is a domain-independent general artificial intelligence model.

In the section on the ideal Large Language Model (LLM), the blog discusses the characteristics and capabilities of an ideal LLM. It emphasizes the self-learning capabilities of LLMs, the ability to tackle problems across different subfields, and the importance of adapting LLMs to user-friendly interfaces. It also mentions the impact of ChatGPT in integrating human preferences into LLMs and the future potential for LLMs to expand into other fields such as image processing and multimodal tasks.

Overall, the blog provides insights into the impact of ChatGPT, the hierarchy in LLM development, and the future directions for LLM technology.

Introduction

Since the emergence of OpenAI ChatGPT, many people and companies have been both surprised and awakened in academia and industry. I was pleasantly surprised because I did not expect a Large Language Model (LLM) to be as effective at this level, and I was also shocked because most of our academic & industrial understanding of LLM and its development philosophy is far from the world’s most advanced ideas. This blog series covers my reviews, reflections, and thoughts about LLM.

From GPT 3.0, LLM is not merely a specific technology; it actually embodies a developmental concept that outlines where LLM should be heading. From a technical standpoint, I personally believe that the main gap exists in the different understanding of LLM and its development philosophy for the future regardless of the financial resources to build LLM.

While many AI-related companies are currently in a “critical stage of survival,” I don’t believe it is as dire as it may seem. OpenAI is the only organization with a forward-thinking vision in the world. ChatGPT has demonstrated exceptional performance that has left everyone trailing behind, even super companies like Google lag behind in their understanding of the LLM development concept and version of their products.

In the field of LLM (Language Model), there is a clear hierarchy. OpenAI is leading internationally, being about six months to a year ahead of Google and DeepMind, and approximately two years ahead of China. Google holds the second position, with technologies like PaLM 1/2, Pathways and Generative AI on GCP Vertex AI, which aligns with their technical vision. These were launched between February and April of 2022, around the same time as OpenAI’s InstructGPT 3.5. This highlights the difference between Google and OpenAI.

DeepMind has mainly focused on reinforcement learning for games and AI in science. They started paying attention to LLM in 2021, and they are currently catching up. Meta AI, previously known as Facebook AI, hasn’t prioritized LLM in the past, but they are now trying to catch up with recently open-sourced Llama 2. These institutions are currently among the best in the field.

To summarize the mainstream LLM technology I mainly focus on the Transformer, BERT, GPT and ChatGPT <=4.0.

NLP Research Paradigm Transformation

Taking a look back at the early days of deep learning in Natural Language Processing (NLP), we can see significant milestones over the past two decades. There have been two major shifts in the technology of NLP.

Paradigm Shift 1.0 (2013): From Deep Learning to Two-Stage Pre-trained Models

The period of this paradigm shift encompasses roughly the time frame from the introduction of deep learning into the field of NLP, around 2013, up until just before the emergence of GPT 3.0, which occurred around May 2020.

Prior to the rise of models like BERT and GPT, the prevailing technology in the NLP field was deep learning. It was primarily reliant on two core technologies:

  1. A plethora of enhanced LSTM models and a smaller number of improved ConvNet models served as typical Feature Extractors.
  2. A prevalent technical framework for various specific tasks was based on Sequence-to-Sequence (or Encoder-Decoder) Architectures coupled with Attention mechanisms.

With these foundational technologies in place, the primary research focus in deep learning for NLP revolved around how to effectively increase model depth and parameter capacity. This involved the continual addition of deeper LSTM or CNN layers to encoders and decoders with the aim of enhancing layer depth and model capacity. Despite these efforts successfully deepening the models, their overall effectiveness in solving specific tasks was somewhat limited. In other words, the advantages gained compared to non-deep learning methods were not particularly significant.

The difficulties that have held back the success of deep learning in NLP can be attributed to two main issues:

  1. Scarcity of Training Data: One significant challenge is the lack of enough training data for specific tasks. As the model becomes more complex, it requires more data to work effectively. This used to be a major problem in NLP research before the introduction of pre-trained models.
  2. Limited Ability of LSTM/CNN Feature Extractors: Another issue is that the feature extractors using LSTM/CNN are not versatile enough. This means that, no matter how much data you have, the model struggles to make good use of it because it can’t effectively capture and utilize the information within the data.

These two factors seem to be the primary obstacles that have prevented deep learning from making significant advancements in the field of NLP.

The advent of two pre-training models, Bert and GPT, marks a significant technological advancement in the field of NLP.

About a year after the introduction of Bert, the technological landscape had essentially consolidated into these two core models.

This development has had a profound impact on both academic research and industrial applications, leading to a complete transformation of the research paradigm in the field. The impact of this paradigm shift can be observed in two key areas:

  • firstly, a decline and, in some cases, the gradual obsolescence of certain NLP research subfields;
  • secondly, the growing standardization of technical methods and frameworks across different NLP subfields.

Impact 1: The Decline of Intermediate Tasks

In the field of NLP, tasks can be categorized into two major groups: “intermediate tasks” and “final tasks.”

  • Intermediate tasks, such as word segmentation, part-of-speech tagging, and syntactic analysis, don’t directly address real-world needs but rather serve as preparatory stages for solving actual tasks. For example, the user doesn’t require a syntactic analysis tree; they just want an accurate translation.
  • In contrast, “final tasks,” like text classification and machine translation, directly fulfil user needs.

Intermediate tasks initially arose due to the limited capabilities of early NLP technology. Researchers segmented complex problems like Machine Translation into simpler intermediate stages because tackling them all at once was challenging. However, the emergence of Bert/GPT has made many of these intermediate tasks obsolete. These models, through extensive pre-training on data, have incorporated these intermediate tasks as linguistic features within their parameters. As a result, we can now address final tasks directly, without modelling these intermediary processes.

Even Chinese word segmentation, a potentially controversial example, follows the same principle. We no longer need to determine which words should constitute a phrase; instead, we let Large Language Models (LLM) learn this as a feature. As long as it contributes to task-solving, LLM will naturally grasp it. This may not align with conventional human word segmentation rules.

In light of these developments, it’s evident that with the advent of Bert/GPT, NLP intermediate tasks are gradually becoming obsolete.

Impact 2: Standardization of Technical Approaches Across All Areas

Within the realm of “final tasks,” there are essentially two categories: natural language understanding tasks and natural language generation tasks.

  • Natural language understanding tasks, such as text classification and sentiment analysis, involve categorizing input text.
  • In contrast, natural language generation tasks encompass areas like chatbots, machine translation, and text summarization, where the model generates output text based on input.

Since the introduction of the Bert/GPT models, a clear trend towards technical standardization has emerged.

Firstly, feature extractors across various NLP subfields have shifted from LSTM/CNN to Transformer. The writing was on the wall shortly after Bert’s debut, and this transition became an inevitable trend.

Currently, Transformer not only unifies NLP but is also gradually supplanting other models like CNN in various image processing tasks. Multi-modal models have similarly adopted the Transformer framework. This Transformer journey, starting in NLP, is expanding into various AI domains, kickstarted by the Vision Transformer (ViT) in late 2020. This expansion shows no signs of slowing down and is likely to accelerate further.

Secondly, most NLP subfields have adopted a two-stage model: model pre-training followed by application fine-tuning or Zero/Few Shot Prompt application.

To be more specific, various NLP tasks have converged into two pre-training model frameworks:

  • For natural language understanding tasks, the “bidirectional language model pre-training + application fine-tuning” model represented by Bert has become the standard.
  • For natural language generation tasks, the “autoregressive language model (i.e., one-way language model from left to right) + Zero/Few Shot Prompt” model represented by GPT 2.0 is now the norm.

Though these models may appear similar, they are rooted in distinct development philosophies, leading to divergent future directions. Regrettably, many of us initially underestimated the potential of GPT’s development route, instead placing more focus on Bert’s model.

Paradigm Shift 2.0 (2020): Moving from Pre-Trained Models to General Artificial Intelligence (AGI)

This paradigm shift began around the time GPT 3.0 emerged, approximately in June 2020, and we are currently undergoing this transition.

ChatGPT served as a pivotal point in initiating this paradigm shift. However, before the appearance of InstructGPT, Large Language Models (LLM) were in a transitional phase.

Transition Period: Dominance of the “Autoregressive Language Model + Prompting” Model as Seen in GPT 3.0

As mentioned earlier, during the early stages of pre-training model development, the technical landscape primarily converged into two distinct paradigms: the Bert mode and the GPT mode. Bert was the favoured path, with several technical improvements aligning with that direction. However, as technology progressed, we observed that the largest LLM models currently in use are predominantly based on the “autoregressive language model + Prompting” model, similar to GPT 3.0. Models like GPT 3, PaLM, GLaM, Gopher, Chinchilla, MT-NLG, LaMDA, and more all adhere to this model, without exceptions.

Why has this become the prevailing trend? There are likely two key reasons driving this shift, and I believe they are at the forefront of this transition.

Firstly, Google’s T5 model plays a crucial role in formally uniting the external expressions of both natural language understanding and natural language generation tasks. In the T5 model, tasks that involve natural language understanding, like text classification and determining sentence similarity (marked in red and yellow in the figure above), align with generation tasks in terms of input and output format.

This means that classification tasks can be transformed within the LLM model to generate corresponding category strings, achieving a seamless integration of understanding and generation tasks. This compatibility allows natural language generation tasks to harmonize with natural language understanding tasks, a feat that would be more challenging to accomplish the other way around.

The second reason is that if you aim to excel at zero-shot prompting or few-shot prompting, the GPT mode is essential.

Now, recent studies, as referenced in “On the Role of Bidirectionality in Language Model Pre-Training,” demonstrate that when downstream tasks are resolved during fine-tuning, the Bert mode outperforms the GPT mode. Conversely, if you employ zero-shot or few-shot prompting to tackle downstream tasks, the GPT mode surpasses the Bert mode.

But this leads to an important question: Why do we strive to use zero-shot or few-shot prompting for task completion? To answer this question, we first need to address another: What type of Large Language Model (LLM) is the most ideal for our needs?

The Ideal Large Language Model (LLM)

The image above illustrates the characteristics of an ideal Large Language Model (LLM). Firstly, the LLM should possess robust self-learning capabilities. When fed with various types of data such as text and images from the world, it should autonomously acquire the knowledge contained within. This learning process should require no human intervention, and the LLM should be adept at flexibly applying this knowledge to address real-world challenges. Given the vastness of the data, this model will naturally be substantial in size, a true giant model.

Secondly, the LLM should be capable of tackling problems across any subfield of Natural Language Processing (NLP) and extend its capabilities to domains beyond NLP. Ideally, it should proficiently address queries from any discipline.

Moreover, when we utilize the LLM to resolve issues in a particular field, the LLM should understand human commands and use expressions that align with human conventions. In essence, it should adapt to humans, rather than requiring humans to adapt to the LLM model.

A common example of people adapting to LLM is the need to brainstorm and experiment with various prompts in order to find the best prompts for a specific problem. In this context, the figure above provides several examples at the interface level where humans interact with the LLM, illustrating the ideal interface design for users to effectively utilize the LLM model.

Now, let’s revisit the question: Why should we pursue zero-shot/few-shot prompting to complete tasks? There are two key reasons:

  1. The Enormous Scale of LLM Models: Building and modifying LLM models of this scale requires immense resources and expertise, and very few institutions can undertake this. However, there are numerous small and medium-sized organizations and even individuals who require the services of such models. Even if these models are open-sourced, many lack the means to deploy and fine-tune them. Therefore, an approach that allows task requesters to complete tasks without tweaking the model parameters is essential. In this context, prompt-based methods offer a solution to fulfil tasks without relying on fine-tuning (note that soft prompting deviates from this trend). LLM model creators aim to make LLM a public utility, operating it as a service. To accommodate the evolving needs of users, model producers must strive to enable LLM to perform a wide range of tasks. This objective is a byproduct and a practical reason why large models inevitably move toward achieving General Artificial Intelligence (AGI).
  2. The Evolution of Prompting Methods: Whether it’s zero-shot prompting, few-shot prompting, or the more advanced Chain of Thought (CoT) prompting that enhances LLM’s reasoning abilities, these methods align with the technology found in the interface layer illustrated earlier. The original aim of zero-shot prompting was to create the ideal interface between humans and LLM, using the task expressions that humans are familiar with. However, it was found that LLM struggled to understand and perform well with this approach. Subsequent research revealed that when a few examples were provided to represent the task description, LLM’s performance improved, leading to the exploration of better few-shot prompting technologies. In essence, our initial hope was for LLM to understand and execute tasks using natural, human-friendly commands. However, given the current technological limitations, these alternative methods have been adopted to express human task requirements.

Understanding this logic, it becomes evident that few-shot prompting, also known as In Context Learning, is a transitional technology. When we can describe a task more naturally and LLM can comprehend it, we will undoubtedly abandon these transitional methods. The reason is clear: using these approaches to articulate task requirements does not align with human habits and usage patterns.

This is also why I classify GPT 3.0+Prompting as a transitional technology. The arrival of ChatGPT has disrupted this existing state of affairs by introducing Instruct instead of Prompting. This change marks a new technological paradigm shift and has subsequently led to several significant consequences.

Impact 1: LLM Adapting Humans NEEDS with Natural Interfaces

In the context of an ideal LLM, let’s focus on ChatGPT to grasp its technical significance. ChatGPT stands out as one of the technologies that align most closely with the ideal LLM, characterized by its remarkable attributes: “Powerful and considerate.”

This “powerful capability” can be primarily attributed to the foundation provided by the underlying LLM, GPT 3.5, on which ChatGPT relies. While ChatGPT includes some manually annotated data, the scale is relatively small, amounting to tens of thousands of examples. In contrast, GPT 3.5 was trained on hundreds of billions of token-level data, making this additional data negligible in terms of its contribution to the vast wealth of world knowledge and common sense already embedded in GPT 3.5. Hence, ChatGPT’s power primarily derives from the GPT 3.5 model, which sets the benchmark for the ideal LLM models.

But does ChatGPT infuse new knowledge into the GPT 3.5 model? Yes, it does, but this knowledge isn’t about facts or world knowledge; it’s about human preferences. “Human preference” encompasses a few key aspects:

  • First and foremost, it involves how humans naturally express tasks. For instance, humans typically say, “Translate the following sentence from Chinese to English” to convey the need for “machine translation.” But LLMs aren’t humans, so understanding such commands is a challenge. To bridge this gap, ChatGPT introduces this knowledge into GPT 3.5 through manual data annotation, making it easier for the LLM to comprehend human commands. This is what empowers ChatGPT with “empathy.”
  • Secondly, humans have their own standards for what constitutes a good or bad answer. For example, a detailed response is deemed good, while an answer containing discriminatory content is considered bad. The feedback data that people provide to LLM through the Reward Model embodies this quality preference. In essence, ChatGPT imparts human preference knowledge to GPT 3.5, resulting in an LLM that comprehends human language and is more polite.

The most significant contribution of ChatGPT is its achievement of the interface layer of the ideal LLM. It allows the LLM to adapt to how people naturally express commands, rather than requiring people to adapt to the LLM’s capabilities and devise intricate command interfaces. This shift enhances the usability and user experience of LLM.

It was InstructGPT/ChatGPT that initially recognized this challenge and offered a viable solution. This is also their most noteworthy technical contribution. In comparison to prior few-shot prompting methods, it is a human-computer interface technology that aligns better with human communication habits for interacting with LLM.

This achievement is expected to inspire subsequent LLM models and encourage further efforts in creating user-friendly human-computer interfaces, ultimately making LLM more responsive to human needs.

Impact 2: Many NLP subfields no longer have independent research value

In the realm of NLP, this paradigm shift signifies that many independently existing NLP research fields will be incorporated into the LLM technology framework, gradually losing their independent status and fading away. Following the initial paradigm shift, while numerous “intermediate tasks” in NLP are no longer required as independent research areas, most of the “final tasks” remain and have transitioned to a “pre-training + fine-tuning” framework, sparking various improvement initiatives to tackle specific domain challenges.

Current research demonstrates that for many NLP tasks, as the scale of LLM models increases, their performance significantly improves. From this, one can infer that many of the so-called “unique” challenges in a given field likely stem from a lack of domain knowledge. With sufficient domain knowledge, these seemingly field-specific issues can be effectively resolved. Thus, there’s often no need to focus intensely on field-specific problems and devise specialized solutions. The path to achieving AGI might be surprisingly straightforward: provide more data in a given field to the LLM and let it autonomously accumulate knowledge.

In this context, ChatGPT proves that we can now directly pursue the ideal LLM model. Therefore, the future technological trend should involve the pursuit of ever-larger LLM models by expanding the diversity of pre-training data, allowing LLMs to independently acquire domain-specific knowledge through pre-training. As the model scale continues to grow, numerous problems will be addressed, and the research focus will shift to constructing this ideal LLM model rather than solving field-specific problems. Consequently, more NLP subfields will be integrated into the LLM technology system and gradually phase out.

In my view, the criteria for determining whether independent research in a specific field should cease can be one of the following two methods:

  • First, assess whether the LLM’s research performance surpasses human performance for a particular task. For fields where LLM outperforms humans, there is no need for independent research. For instance, for many tasks within the GLUE and SuperGLUE test sets, LLMs currently outperform humans, rendering independently existing research fields closely associated with these datasets unnecessary.
  • Second, compare task performance between the two modes. The first mode involves fine-tuning with extensive domain-specific data, while the second mode employs few-shot prompting or instruct-based techniques. If the second mode matches or surpasses the performance of the first, it indicates that the field no longer needs to exist independently. By this standard, many research fields currently favour fine-tuning (due to the abundance of training data), seemingly justifying their independent existence. However, as models grow in size, the effectiveness of few-shot prompting continues to rise, and it’s likely that this turning point will be reached in the near future.

If these speculations hold true, it presents the following challenging realities:

  • For many NLP researchers, they must decide which path to pursue. Should they persist in addressing field-specific challenges?
  • Or should they abandon what may seem like a less promising route and instead focus on constructing a superior LLM?
  • If the choice is to invest in LLM development, which institutions possess the ability and resources to undertake this endeavour?
  • What’s your response to this question?

Impact 3: More research fields other than NLP will be included in the LLM technology system

From the perspective of AGI, referring to the ideal LLM model described previously, the tasks it can complete should not be limited to the NLP field or one or two subject areas. The ideal LLM should be a domain-independent general artificial intelligence model. , it is now doing well in one or two fields, but it does not mean that it can only do these tasks.

The emergence of ChatGPT proves that it is feasible for us to pursue AGI in this period, and now is the time to put aside the shackles of “field discipline” thinking.

In addition to demonstrating its ability to solve various NLP tasks in a smooth conversational format, ChatGPT also has powerful coding capabilities. Naturally, more and more other research fields will be gradually included in the LLM system and become part of general artificial intelligence.

LLM expands its field from NLP to the outside world. A natural choice is image processing and multi-modal related tasks. There are already some efforts to integrate multimodality and make LLM a universal human-computer interface that supports multimodal input and output. Typical examples include DeepMind’s Flamingo and Microsoft’s “Language Models are General-Purpose Interfaces”, as shown above. The conceptual structure of this approach is demonstrated.

My judgment is that whether it is images or multi-modality, the future integration into LLM to become useful functions may be slower than we think.

The main reason is that although the image field has been imitating Bert’s pre-training approach in the past two years, it is trying to introduce self-supervised learning to release the model’s ability to independently learn knowledge from image data. Typical technologies are “contrastive learning” and MAE. These are two different technical routes.

However, judging from the current results, despite great technological progress, it seems that this road has not yet been completed. This is reflected in the application of pre-trained models in the image field to downstream tasks, which brings far fewer benefits than Bert or GPT. The application is significant in NLP downstream tasks.

Therefore, image preprocessing models still need to be explored in depth to unleash the potential of image data, which will delay their unification into large LLM models. Of course, if this road is opened one day, there is a high probability that the current situation in the field of NLP will be repeated, that is, various research subfields of image processing may gradually disappear and be integrated into large-scale LLM to directly complete terminal tasks.

In addition to images and multi-modality, it is obvious that other fields will gradually be included in the ideal LLM. This direction is in the ascendant and is a high-value research topic.

The above are my personal thoughts on paradigm shift. Next, let’s sort out the mainstream technological progress of the LLM model after GPT 3.0.

As shown in the ideal LLM model, related technologies can actually be divided into two major categories;

  • One category is about how the LLM model absorbs knowledge from data and also includes the impact of model size growth on LLM’s ability to absorb knowledge;
  • The second category is about human-computer interfaces about how people use the inherent capabilities of LLM to solve tasks, including In Context Learning and Instruct modes. Chain of Thought (CoT) prompting, an LLM reasoning technology, essentially belongs to In Context Learning. Because they are more important, I will talk about them separately.

Reference

  • ​​1.  Vaswani, A. et al. Transformer: Attention Is All You Need. https://arxiv.org/pdf/1706.03762.pdf (2017).
  • ​2.  Openai, A. R., Openai, K. N., Openai, T. S. & Openai, I. S. GPT: Improving Language Understanding by Generative Pre-Training. (2018).
  • ​3.  Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. (2018).
  • ​4.  Radford, A. et al. GPT2: Language Models are Unsupervised Multitask Learners. (2019).
  • ​5.  Brown, T. B. et al. GPT3: Language Models are Few-Shot Learners. (2020).
  • ​6.  Ouyang, L. et al. GPT 3.5: Training language models to follow instructions with human feedback. (2022).
  • ​7.  Eloundou, T., Manning, S., Mishkin, P. & Rock, D. GPT4: GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models. (2023).
  • ​8.  OpenAI. Introducing ChatGPT. https://openai.com/blog/chatgpt (2022).
  • ​9.  Schulman, J., Wolski, F., Dhariwal, P., Radford, A. & Openai, O. K. Proximal Policy Optimization Algorithms. (2017).