ChatGPT – C. Cui's Blog

AI-Powered Search: Google’s Transformation vs. Perplexity

TL;DR, Play the podcast (Audio Overview generated by NotebookLM)

Abstract

This blog examines the rapidly evolving landscape of AI-powered search, comparing Google’s recent transformation with its Search Generative Experience (SGE) and Gemini integration against Perplexity AI‘s native AI-first approach. Both companies now leverage large language models, but with fundamentally different architectures and philosophies.

The New Reality: Google has undergone a dramatic transformation from traditional keyword-based search to an AI-driven conversational answer engine. With the integration of Gemini, LaMDA, PaLM, and the rollout of AI Overviews (formerly SGE), Google now synthesizes information from multiple sources into concise, contextual answers—directly competing with Perplexity’s approach.

Key Findings:

Convergent Evolution: Both platforms now use LLMs for answer generation, but Google maintains its traditional search infrastructure while Perplexity was built AI-first from the ground up
Architecture Philosophy: Google integrates AI capabilities into its existing search ecosystem (hybrid approach), while Perplexity centers everything around RAG and multi-model orchestration (AI-native approach)
AI Technology Stack: Google leverages Gemini (multimodal), LaMDA (conversational), and PaLM models, while Perplexity orchestrates external models (GPT, Claude, Gemini, Llama, DeepSeek)
User Experience: Google provides AI Overviews alongside traditional search results, while Perplexity delivers answer-first experiences with citations
Market Dynamics: The competition has intensified with Google’s AI transformation, making the choice between platforms more about implementation philosophy than fundamental capabilities

This represents a paradigm shift where the question is no longer “traditional vs. AI search” but rather “how to best implement AI-powered search” with different approaches to integration, user experience, and business models.

Keywords: AI Search, RAG, Large Language Models, Search Architecture, Perplexity AI, Google Search, Conversational AI, SGE, Gemini.

Google’s AI Transformation: From PageRank to Gemini-Powered Search

Google has undergone one of the most significant transformations in its history, evolving from a traditional link-based search engine to an AI-powered answer engine. This transformation represents a strategic response to the rise of AI-first search platforms and changing user expectations.

The Search Generative Experience (SGE) Revolution

Google’s Search Generative Experience (SGE), now known as AI Overviews, fundamentally changes how search results are presented:

AI-Synthesized Answers: Instead of just providing links, Google’s AI generates comprehensive insights, explanations, and summaries from multiple sources
Contextual Understanding: Responses consider user context including location, search history, and preferences for personalized results
Multi-Step Query Handling: The system can handle complex, conversational queries that require reasoning and synthesis
Real-Time Information Grounding: AI overviews are grounded in current, real-time information while maintaining accuracy

Google’s LLM Arsenal

Google has strategically integrated multiple advanced AI models into its search infrastructure:

Gemini: The Multimodal Powerhouse

Capabilities: Understands and generates text, images, videos, and audio
Search Integration: Enables complex query handling including visual search, reasoning tasks, and detailed information synthesis
Multimodal Processing: Handles queries that combine text, images, and other media types

LaMDA: Conversational AI Foundation

Purpose: Powers natural, dialogue-like interactions in search
Features: Enables follow-up questions and conversational context maintenance
Integration: Supports Google’s shift toward conversational search experiences

PaLM: Large-Scale Language Understanding

Role: Provides advanced language processing capabilities
Applications: Powers complex reasoning, translation (100+ languages), and contextual understanding
Scale: Handles extended documents and multimodal inputs

Technical Architecture Integration

Google’s approach differs from AI-first platforms by layering AI capabilities onto existing infrastructure:

Key Differentiators of Google’s AI Search

Hybrid Architecture: Maintains traditional search capabilities while adding AI-powered features
Scale Integration: Leverages existing massive infrastructure and data
DeepMind Synergy: Strategic integration of DeepMind research into commercial search applications
Continuous Learning: ML ranking algorithms and AI models learn from user interactions in real-time
Global Reach: AI features deployed across 100+ languages with localized understanding

Perplexity AI Architecture: The RAG-Powered Search Revolution

Perplexity AI represents a fundamental reimagining of search technology, built on three core innovations:

Retrieval-Augmented Generation (RAG): Combines real-time web crawling with large language model capabilities
Multi-Model Orchestration: Leverages multiple AI models (GPT, Claude, Gemini, Llama, DeepSeek) for optimal responses
Integrated Citation System: Provides transparent source attribution with every answer

The platform offers multiple access points to serve different user needs: Web Interface, Mobile App, Comet Browser, and Enterprise API.

Core Architecture Components

Simplified Architecture View

For executive presentations and high-level discussions, this three-layer view highlights the essential components:

How Perplexity Works: From Query to Answer

Understanding Perplexity’s workflow reveals why it delivers fundamentally different results than traditional search engines. Unlike Google’s approach of matching keywords to indexed pages, Perplexity follows a sophisticated multi-step process:

The Eight-Step Journey

Query Reception: User submits a natural language question through any interface
Real-Time Retrieval: Custom crawlers search the web for current, relevant information
Source Indexing: Retrieved content is processed and indexed in real-time
Context Assembly: RAG system compiles relevant information into coherent context
Model Selection: AI orchestrator chooses the optimal model(s) for the specific query type
Answer Generation: Selected model(s) generate comprehensive responses using retrieved context
Citation Integration: System automatically adds proper source attribution
Response Delivery: Final answer with citations is presented to the user

Technical Workflow Diagram

The sequence below shows how a user query flows through Perplexity’s system.

This process typically completes in under 3 seconds, delivering both speed and accuracy.

The New Search Paradigm: AI-First vs AI-Enhanced Approaches

The competition between Google and Perplexity has evolved beyond traditional vs. AI search to represent two distinct philosophies for implementing AI-powered search experiences.

Google’s Philosophy: “AI-Enhanced Universal Search”

Hybrid Integration: Layer advanced AI capabilities onto proven search infrastructure
Comprehensive Coverage: Maintain traditional search results alongside AI-generated overviews
Gradual Transformation: Evolve existing user behaviors rather than replace them entirely
Scale Advantage: Leverage massive existing data and infrastructure for AI training and deployment

Perplexity’s Philosophy: “AI-Native Conversational Search”

Model Agnostic: Orchestrate best-in-class models rather than developing proprietary AI
Clean Slate Design: Built from the ground up with AI-first architecture
Answer-Centric: Focus entirely on direct answer generation with source attribution
Conversational Flow: Design for multi-turn, contextual conversations rather than single queries

Comprehensive Technology & Business Comparison

Dimension	Google AI-Enhanced Search	Perplexity AI-Native Search
Input	Natural language + traditional keywords	Pure natural language, conversational
AI Models	Gemini, LaMDA, PaLM (proprietary)	GPT, Claude, Gemini, Llama, DeepSeek (orchestrated)
Architecture	Hybrid (AI + traditional infrastructure)	Pure AI-first (RAG-centered)
Retrieval	Enhanced index + Knowledge Graph + real-time	Custom crawler + real-time retrieval
Core Tech	AI Overviews + traditional ranking	RAG + multi-model orchestration
Output	Hybrid (AI Overview + links + ads)	Direct answers with citations
Context	Limited conversational memory	Full multi-turn conversation memory
Extensions	Maps, News, Shopping, Ads integration	Document search, e-commerce, APIs
Business	Ad-driven + AI premium features	Subscription + API + e-commerce
UX	“AI answers + traditional options”	“Conversational AI assistant”
Products	Google Search with SGE/AI Overview	Perplexity Web/App, Comet Browser
Deployment	Global rollout with localization	Global expansion, English-focused
Data Advantage	Massive proprietary data + real-time	Real-time web data + model diversity
Products	Google Search, Ads	Perplexity Web/App, Comet Browser

The Future of AI-Powered Search: A New Competitive Landscape

The integration of AI into search has fundamentally changed the competitive landscape. Rather than a battle between traditional and AI search, we now see different approaches to implementing AI-powered experiences competing for user mindshare and market position.

Implementation Strategy Battle: Integration vs. Innovation

Google’s Integration Strategy:

Advantage: Massive user base and infrastructure to deploy AI features at scale
Challenge: Balancing AI innovation with existing business model dependencies
Approach: Gradual rollout of AI features while maintaining traditional search options

Perplexity’s Innovation Strategy:

Advantage: Clean slate design optimized for AI-first experiences
Challenge: Building user base and competing with established platforms
Approach: Focus on superior AI experience to drive user acquisition

Both platforms are moving toward comprehensive multi-modal experiences:

Visual Search Integration: Google Lens vs. Perplexity’s image understanding capabilities
Voice-First Interactions: Google Assistant integration vs. conversational AI interfaces
Video and Audio Processing: Gemini’s multimodal capabilities vs. orchestrated model approaches
Document Intelligence: Enterprise document search and analysis capabilities

Business Model Evolution Under AI

Advertising Model Transformation:

Google must adapt its ad-centric model to AI Overviews without disrupting user experience
Challenge of monetizing direct answers vs. traditional click-through advertising
Need for new ad formats that work with conversational AI

Subscription and API Models:

Perplexity’s success with subscription tiers validates alternative monetization
Growing enterprise demand for AI-powered search APIs and integrations
Premium features becoming differentiators (document search, advanced models, higher usage limits)

Technical Architecture Convergence

Despite different starting points, both platforms are converging on similar technical capabilities:

Real-Time Information: Both now emphasize current, up-to-date information retrieval
Source Attribution: Transparency and citation becoming standard expectations
Conversational Context: Multi-turn conversation support across platforms
Model Diversity: Google developing multiple specialized models, Perplexity orchestrating external models

The Browser and Distribution Channel Wars

Perplexity’s Chrome Acquisition Strategy:

$34.5B all-cash bid for Chrome represents unprecedented ambition in AI search competition
Strategic Value: Control over browser defaults, user data, and search distribution
Market Impact: Success would fundamentally alter competitive dynamics and user acquisition costs
Regulatory Reality: Bid likely serves as strategic positioning and leverage rather than realistic acquisition

Alternative Distribution Strategies:

AI-native browsers (Comet) as specialized entry points
API integrations into enterprise and developer workflows
Mobile-first experiences capturing younger user demographics

Strategic Implications and Future Outlook

The competition between Google’s AI-enhanced approach and Perplexity’s AI-native strategy represents a fascinating case study in how established platforms and startups approach technological transformation differently.

Key Strategic Insights

The AI Integration Challenge: Google’s transformation demonstrates that even dominant platforms must fundamentally reimagine their core products to stay competitive in the AI era
Architecture Philosophy Matters: The choice between hybrid integration (Google) vs. AI-first design (Perplexity) creates different strengths, limitations, and user experiences
Business Model Pressure: AI-powered search challenges traditional advertising models, forcing experimentation with subscriptions, APIs, and premium features
User Behavior Evolution: Both platforms are driving the shift from “search and browse” to “ask and receive” interactions, fundamentally changing how users access information

The New Competitive Dynamics

Advantages of Google’s AI-Enhanced Approach:

Massive scale and infrastructure for global AI deployment
Existing user base to gradually transition to AI features
Deep integration with knowledge graphs and proprietary data
Ability to maintain traditional search alongside AI innovations

Advantages of Perplexity’s AI-Native Approach:

Optimized user experience designed specifically for conversational AI
Agility to implement cutting-edge AI techniques without legacy constraints
Model-agnostic architecture leveraging best-in-class external AI models
Clear value proposition for users seeking direct, cited answers

Looking Ahead: Industry Predictions

Near-Term (1-2 years):

Continued convergence of features between platforms
Google’s global rollout of AI Overviews across all markets and languages
Perplexity’s expansion into enterprise and specialized vertical markets
Emergence of more AI-native search platforms following Perplexity’s model

Medium-Term (3-5 years):

AI-powered search becomes the standard expectation across all platforms
Specialized AI search tools for professional domains (legal, medical, scientific research)
Integration of real-time multimodal capabilities (live video analysis, augmented reality search)
New regulatory frameworks for AI-powered information systems

Long-Term (5+ years):

Fully conversational AI assistants replace traditional search interfaces
Personal AI agents that understand individual context and preferences
Integration with IoT and ambient computing for seamless information access
Potential emergence of decentralized, blockchain-based search alternatives

Recommendations for Stakeholders

For Technology Leaders:

Hybrid Strategy: Consider Google’s approach of enhancing existing systems with AI rather than complete rebuilds
Model Orchestration: Investigate Perplexity’s approach of orchestrating multiple AI models for optimal results
Real-Time Capabilities: Invest in real-time information retrieval and processing systems
Citation Systems: Implement transparent source attribution to build user trust

For Business Strategists:

Revenue Model Innovation: Experiment with subscription, API, and premium feature models beyond traditional advertising
User Experience Focus: Prioritize conversational, answer-first experiences in product development
Distribution Strategy: Evaluate the importance of browser control and default search positions
Competitive Positioning: Decide between AI-enhancement of existing products vs. AI-native alternatives

For Investors:

Platform Risk Assessment: Evaluate how established platforms are adapting to AI disruption
Technology Differentiation: Assess the sustainability of competitive advantages in rapidly evolving AI landscape
Business Model Viability: Monitor the success of alternative monetization strategies beyond advertising
Regulatory Impact: Consider potential regulatory responses to AI-powered information systems and search market concentration

The future of search will be determined by execution quality, user adoption, and the ability to balance innovation with practical business considerations. Both Google and Perplexity have established viable but different paths forward, setting the stage for continued innovation and competition in the AI-powered search landscape.

Monitor the browser control battle and distribution channel acquisitions
Technology Differentiation: Assess the sustainability of competitive advantages in rapidly evolving AI landscape
Business Model Viability: Monitor the success of alternative monetization strategies beyond advertising
Regulatory Impact: Consider potential regulatory responses to AI-powered information systems and search market concentration

Conclusion

The evolution of search from Google’s traditional PageRank-driven approach to today’s AI-powered landscape represents one of the most significant technological shifts in internet history. Google’s recent transformation with its Search Generative Experience and Gemini integration demonstrates that even the most successful platforms must reinvent themselves to remain competitive in the AI era.

The competition between Google’s AI-enhanced strategy and Perplexity’s AI-native approach offers valuable insights into different paths for implementing AI at scale. Google’s hybrid approach leverages massive existing infrastructure while gradually transforming user experiences, while Perplexity’s clean-slate design optimizes entirely for conversational AI interactions.

As both platforms continue to evolve, the ultimate winners will be users who gain access to more intelligent, efficient, and helpful ways to access information. The future of search will likely feature elements of both approaches: the scale and comprehensiveness of Google’s enhanced platform combined with the conversational fluency and transparency of AI-native solutions.

The battle for search supremacy in the AI era has only just begun, and the innovations emerging from this competition will shape how humanity accesses and interacts with information for decades to come.

This analysis reflects the state of AI-powered search as of August 2025. The rapidly evolving nature of AI technology and competitive dynamics may significantly impact future developments. Both Google and Perplexity continue to innovate at unprecedented pace, making ongoing monitoring essential for stakeholders in this space. This analysis represents the current state of AI-powered search as of August 2025. The rapidly evolving nature of AI technology and competitive landscape may impact future developments.

Technical Review 03: Scale Effects & What happens when LLMs get bigger and bigger

AI Assitant Summary

This blog discusses the scale of Large Language Models (LLMs) and their impact on performance. LLMs like GPT, LaMDA, and PaLM have billions of parameters, raising questions about the consequences of their continued growth.

The journey of an LLM involves two stages: pre-training and scenario application. Pre-training focuses on optimizing the model using cross-entropy, while scenario application evaluates the model’s performance in specific use cases. Evaluating an LLM’s quality requires considering both stages, rather than relying solely on pre-training indicators.

Increasing training data, model parameters, and training time has been found to enhance performance in the pre-training stage. OpenAI and DeepMind have explored this issue, with OpenAI finding that a combination of more data and parameters, along with fewer training steps, produces the best results. DeepMind considers the amount of training data and model parameters equally important.

The influence of model size on downstream tasks varies. Linear tasks show consistent improvement as the model scales, while breakthrough tasks only benefit from larger models once they reach a critical scale. Tasks involving logical reasoning demonstrate sudden improvement at specific model scales. Some tasks exhibit U-shaped growth, where performance initially declines but then improves with larger models.

Reducing the LLM’s parameters while increasing training data proportionally can decrease the model’s size without sacrificing performance, leading to faster inference speed.

Understanding the impact of model size on both pre-training and downstream tasks is vital for optimizing LLM performance and exploring the potential of these language models.

Introduction

In recent years, we’ve witnessed a surge in the size of Large Language Models (LLMs), with models now boasting over 100 billion parameters becoming the new standard. Think OpenAI’s GPT-3 (175B), Google’s LaMDA (137B), PaLM (540B), and other global heavyweights. China, too, contributes to this landscape with models like Zhiyuan GLM, Huawei’s “Pangu,” Baidu’s “Wenxin,” etc. But here’s the big question: What unfolds as these LLMs continue to grow?

The journey of pre-trained models involves two crucial stages: pre-training and scenario application.

In the pre-training stage, the optimization goal is cross entropy. For autoregressive language models such as GPT, it is to see whether LLM correctly predicts the next word;

However, the real test comes in the scenario application stage, where specific use cases dictate evaluation criteria. Generally, our intuition is that if the LLM has better indicators in the pre-training stage, its ability to solve downstream tasks will naturally be stronger. However, this is not entirely true.

Existing research has proven that the optimization index in the pre-training stage does show a positive correlation with downstream tasks, but it is not completely positive. In other words, it is not enough to only look at the indicators in the pre-training stage to judge whether an LLM model is good enough. Based on this, we will look separately at these two different stages to see what the impact will be as the LLM model increases.

Part One: pre-training phase

First, let’s look at what happens as the model size gradually increases during the pre-training stage. OpenAI specifically studied this issue in “Scaling Laws for Neural Language Models” and proposed the “scaling law” followed by the LLM model.

Source: Scaling Laws for Neural Language Models

As shown in the figure above, this study proves that when we independently increase (1) the amount of training data, (2) model parameter size and (3) extend the model training time (such as from 1 Epoch to 2 Epochs), the Loss of the pre-trained model on the test set will decrease monotonically. In other words, the model’s effectiveness is improving steadily.

Since all three factors are important when we actually do pre-training, we have a decision-making problem on how to allocate computing power:

Question: Assuming that the total computing power budget used to train LLM (such as fixed GPU hours or GPU days) is given. How to allocate computing power?

Should we increase the amount of data and reduce model parameters?

Or should we increase the amount of data and model size at the same time but reduce the number of training steps?

Open AI

As one zero-sum game, the scale of one-factor increases, and the scale of other factors must be reduced to keep the total computing power unchanged, so there are various possible computing power allocation plans.

In the end, OpenAI chose to increase the amount of training data and model parameters at the same time but used an early stopping strategy to reduce the number of training steps. Because it proves that: for the two elements of training data volume and model parameters, if you only increase one of them separately, this is not the best choice. It is better to increase both at the same time according to a certain proportion. Its conclusion is to give priority to increasing the model parameters, and then the amount of training data.

Assuming that the total computing power budget used to train LLM increases by 10 times, then the amount of model parameters should be increased by 5.5 times and the amount of training data should be increased by 1.8 times. At this time, the model gets the best performance.

Deep Mind

A study by DeepMind (Reference: Training Compute-Optimal Large Language Models) explored this issue in more depth.

Source: Training Compute-Optimal Large Language Models

Its basic conclusions are similar to those of OpenAI. For example, it is indeed necessary to increase the amount of training data and model parameters at the same time, so that the model effect will be better.

Many large models do not consider this when doing pre-training. Many large LLM models were trained just monotonically increasing the model parameters while fixing the amount of training data. This approach is wrong and limits the potential of the LLM model.

However, DeepMind corrects the proportional relationship between the two by OpenAI and believes that the amount of training data and model parameters are equally important.

In other words, assuming that the total computing power budget used to train LLM increases by 10 times, the number of model parameters should be increased by 3.3 times, and the amount of training data should also be increased by 3.3 times to get the best model.

This means that increasing the amount of training data is more important than we previously thought. Based on this understanding, DeepMind chose another configuration in terms of computing power allocation when designing the Chinchilla model: compared with the Gopher model with a data volume of 300B and a model parameter volume of 280B, Chinchilla chose to increase the training data by 4 times, but reduced the model The parameters are reduced to one-fourth that of Gopher, which is about 70B. However, regardless of pre-training indicators or many downstream task indicators, Chinchilla is better than the larger Gopher.

This brings us to the following enlightenment:

We can choose to enlarge the training data and reduce the LLM model parameters in the same proportion to achieve the purpose of greatly reducing the size of the model without reducing the model performance.

Reducing the size of the model has many benefits, such as the inference speed will be much faster when applied. This is undoubtedly a promising development route for LLM.

Part Two: downstream tasks

The above is the impact of the model scale from the pre-training stage. From the perspective of the effect of LLM on solving specific downstream tasks, as the model scale increases, different types of tasks have different performances.

Source: Beyond the Imitation Game Benchmark

Specifically, there are the following three types of tasks.

(a) Tasks that achieve the highest linearity scores see model performance improve predictably with scale and typically rely on knowledge and simple textual manipulations.
(b) Tasks with high breakthroughs do not see model performance improve until the model reaches a critical scale. These tasks generally require sequential steps or logical reasoning. Around 5% of BIG-bench tasks see models achieve sudden score breakthroughs with increasing scale.
(c) Tasks that achieve the lowest (negative) linearity scores see model performance degrade with scale.

Linearity Tasks

The first type of task perfectly reflects the scaling law of the LLM model, which means that as the model scale gradually increases, the performance of the tasks gets better and better, as shown in (a) above.

Such tasks usually have the following common characteristics: they are often knowledge-intensive tasks. That is to say, if the LLM model contains more knowledge, the performance of such tasks will be better.

Many studies have proven that the larger the LLM model, the higher the learning efficiency. For the same amount of training data, the larger the model, the better the performance. This shows that even when faced with the same batch of training data, a larger LLM model is relatively more efficient in getting more knowledge than small ones.

What’s more, under normal circumstances, when increasing the LLM model parameters, the amount of training data will often increase simultaneously, which means that large models can learn more knowledge points from more data. These studies can explain the above figure, why as the model size increases, these knowledge-intensive tasks become better and better.

Most traditional NLP tasks are actually knowledge-intensive tasks, and many tasks have achieved great improvement in the past few years, even surpassing human performance. Obviously, this is most likely caused by the increase in the scale of the LLM model, rather than due to a specific technical improvement.

Breakthroughs Tasks

The second type of task demonstrates that LLM has some kind of “Emergent Ability”, as shown in (b) above. The so-called “emergent ability” means that when the model parameter scale fails to reach a certain threshold, the model basically does not have any ability to solve such tasks, which reflects that its performance is equivalent to randomly selecting answers. However, when the model scale spans Once the threshold is exceeded, the LLM model’s effect on such tasks will experience a sudden performance increase.

In other words, model size is the key to unlocking (unlocking) new capabilities of LLM. As the model size becomes larger and larger, more and more new capabilities of LLM will be gradually unlocked.

This is a very magical phenomenon because it means the following possibilities that make people optimistic about the future. Many tasks that cannot be solved well by LLM at present can be solved in future if we continue to make the model larger. Because LLM has “emergent capabilities” to suddenly unlock those limits one day. The growth of the LLM model will bring us unexpected and wonderful gifts.

The article “Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models” points out that tasks that embody “emergent capabilities” also have some common features: these tasks generally consist of multiple steps, and to solve these tasks, it is often necessary to first Multiple intermediate steps are solved, and logical reasoning skills play an important role in the final solution of such tasks.

Chain of Thought (CoT) Prompting is a typical technology that enhances the reasoning ability of LLM, which can greatly improve the effect of such tasks. I will discuss the CoT technology in the following blogs.

Here the most important question is, why does LLM have this “emergent ability” phenomenon? The article “Emergent Abilities of Large Language Models” shares several possible explanations:

Source: Emergent Abilities of Large Language Models

One possible explanation is that the evaluation indicators of some tasks are not smooth enough. For example, some metrics for generation tasks require that the string output by the model must completely match the standard answer to be considered correct otherwise it will be scored zero.

Thus, even as the model gradually becomes better and outputs more correct character fragments, because it is not completely correct, 0 points will be given for any small errors. Only when the model is large enough, the output Scores are scored when all the output segments are correct. In other words, because the indicator is not smooth enough, it cannot reflect the reality that LLM is actually gradually improving its performance on the task. It seems to be an external manifestation of “emergent ability”.

Another possible explanation is that some tasks are composed of several intermediate steps. As the size of the model increases, the ability to solve each step gradually increases, but as long as one intermediate step is wrong, the final answer will be wrong. This will also lead to this superficial “emergent ability” phenomenon.

Of course, the above explanations are still conjectures at present. As for why LLM has this phenomenon, further and in-depth research is needed.

U-shaped Tasks

Source: Inverse scaling can become U-shaped

There are also a small number of tasks. As the model size increases, the task effect curve shows U-shaped characteristics: as the model size gradually increases, the task effect gradually becomes worse, but when the model size further increases, the effect starts to get better and better. Figure above shows a U-shaped growth trend where the indicator trend of the pink PaLM model on the two tasks.

Why do these tasks appear so special? The article “Inverse Scaling Can Become U-shaped” gives an explanation:

These tasks actually contain two different types of subtasks, one is the real task, and the other is the “interference task ( distractor task)”.

When the model size is small, it cannot identify any sub-task, so the performance of the model is similar to randomly selecting answers.
When the model grows to a medium size, it mainly tries to solve the interference task, so it has a negative impact on the real task performance. This is reflected in the decline of the real task effect.
When the model size is further increased, LLM can ignore the interfering task and perform the real task, which is reflected in the effect starting to grow.

For those tasks whose performance has been declining as the model size increases, if Chain of Thought (CoT) Prompting is used, the performance of some tasks will be converted to follow the Scaling Law. That is, the larger the model size, the better the performance, while other tasks will be converted to a U-shaped growth curve.

This actually shows that this type of task should be a reasoning-type task, so the task performance will change qualitatively after adding CoT.

Personal View

Increasing the size of the LLM model may not seem technically significant, but it is actually very important to build better LLMs. In my opinion, the advancements from Bert to GPT 3 and ChatGPT are likely attributed to the growth of the LLM model size rather than a specific technology. I believe a lot of people want to explore the scale ceiling of the LLM model if possible.

The key to achieving AGI may lie in having large and diverse data, large-scale models, and rigorous training processes. Developing such large LLM models requires high engineering skills from the technical team, which means there is technical content involved.

Increasing the scale of the LLM model has research significance. There are two main reasons why it is valuable.

Firstly, as the model size grows, the performance of various tasks improves, especially for knowledge-intensive tasks. Additionally, for reasoning and difficult tasks, the effect of adding CoT Prompting follows a scaling law. Therefore, it is important to determine to what extent the scale effect of LLM can solve these tasks.
Secondly, the “emergent ability” of LLM suggests that increasing the model size may unlock new capabilities that we did not expect. This raises the question of what these capabilities could be.

Considering these factors, it is necessary to continue increasing the model size to explore the limits of its ability to solve different tasks.

Talk is cheap, and in reality, very few AI/ML practitioners have the opportunity or ability to build larger models due to high financial requirements, investment willingness, engineering capabilities, and technical enthusiasm from research institutions. There are probably no more than 10 institutions that can do this on Earth. However, in the future, there may be a possibility of joint efforts between capable institutions to build a Super-Large model:

All (Resources) for One (Model) and One (Model) for All (People).
Modified from Alexandre Dumas, The Three Musketeers

Reference

OpenAI 2020: Scaling Laws for Neural Language Models (https://arxiv.org/abs/2001.08361)
DeepMind 2022: Training Compute-Optimal Large Language Models (https://arxiv.org/abs/2203.15556)
BIG-bench Project Team: 2023: Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models (https://arxiv.org/abs/2206.04615)
Google 2023: Inverse scaling can become U-shaped (https://arxiv.org/abs/2211.02011)

What’s Next?

Technical Review 04: Human-Computer Interface: From In Context Learning to Instruct Understanding (ChatGPT)

Previous Blogs