Recently, Y Combinator hosted Bob McGrew, the former Chief Research Officer at OpenAI and a veteran technologist from PayPal and Palantir. What surprised many was the line of questioning. Instead of asking him how to build the next GPT, founders kept pressing him on a very different topic: Palantir’s FDE model.
Bob admitted that over the past year, nearly every startup he’s advised has been obsessed with learning how this model works in practice.
What Exactly Is FDE?
FDE (Forward Deployed Engineer) is a model where engineers embed directly with customers to bridge the gap between what the product aspires to be and what the customer actually needs.
The idea traces back to Palantir’s early days working with U.S. intelligence agencies. The challenges were messy, complex, and had no off-the-shelf solutions. The only way forward was to “build on the ground” with the client. At the time, many dismissed it as unscalable, labor-intensive, and far from the clean SaaS ideal. Fast forward to today, and the very same approach is being embraced by AI startups building agents and enterprise solutions.
How It Works
Palantir structured its FDE teams around two roles:
Echo: the industry-savvy operator who lives inside the customer’s workflow, identifies core pain points, and challenges the status quo.
Delta: the technical builder who can spin up prototypes quickly, solving problems in real time.
Meanwhile, the core product team back at HQ takes these frontline hacks and turns them into platform features. Think of it as paving a permanent road where the FDEs first laid down gravel.
Why It Matters
The strength of the FDE model is that it forges unusually deep relationships with customers. It surfaces real market demand—things no survey or user interview could ever uncover. Done right, it creates a defensible moat.
But it’s also risky. Without discipline, FDE can collapse into traditional consulting or body-shop outsourcing. The litmus test of a healthy model is whether the core platform keeps evolving, making each new deployment faster, cheaper, and more scalable.
Different from Consulting
The distinction is critical:
Consulting delivers one-off solutions.
FDE is about feeding learnings back into the product, so the platform gets stronger with every customer.
This feedback loop—and the ability of product managers to abstract from bespoke requests—is what turns customer-specific fixes into reusable product capabilities.
Why AI Startups Love It
For AI Agent companies, the market is far too fragmented and unpredictable for a “one-size-fits-all” solution. No universal product exists. Embedding deeply with customers isn’t optional—it’s the only way to figure out what works, discover product-market fit, and build enduring platforms.
A Shift in Business Models
Unlike traditional SaaS, which scales on pure subscriptions, FDE contracts are more outcome-driven and flexible. The key lever is product leverage: doing the same amount of frontline work but translating it into larger contracts and less marginal customization over time.
The Bigger Picture
The rise of FDE highlights a paradox of modern tech: at scale, the best companies keep doing the things that “don’t scale.” The gulf between breakthrough AI capabilities and messy, real-world adoption is exactly where the biggest opportunities lie today.
It’s not an easy path—more trench warfare than blitzscaling—but for founders, it may be the only one that works.
This blog examines the rapidly evolving landscape of AI-powered search, comparing Google’s recent transformation with its Search Generative Experience (SGE) and Gemini integration against Perplexity AI‘s native AI-first approach. Both companies now leverage large language models, but with fundamentally different architectures and philosophies.
The New Reality: Google has undergone a dramatic transformation from traditional keyword-based search to an AI-driven conversational answer engine. With the integration of Gemini, LaMDA, PaLM, and the rollout of AI Overviews (formerly SGE), Google now synthesizes information from multiple sources into concise, contextual answers—directly competing with Perplexity’s approach.
Key Findings:
Convergent Evolution: Both platforms now use LLMs for answer generation, but Google maintains its traditional search infrastructure while Perplexity was built AI-first from the ground up
Architecture Philosophy: Google integrates AI capabilities into its existing search ecosystem (hybrid approach), while Perplexity centers everything around RAG and multi-model orchestration (AI-native approach)
AI Technology Stack: Google leverages Gemini (multimodal), LaMDA (conversational), and PaLM models, while Perplexity orchestrates external models (GPT, Claude, Gemini, Llama, DeepSeek)
User Experience: Google provides AI Overviews alongside traditional search results, while Perplexity delivers answer-first experiences with citations
Market Dynamics: The competition has intensified with Google’s AI transformation, making the choice between platforms more about implementation philosophy than fundamental capabilities
This represents a paradigm shift where the question is no longer “traditional vs. AI search” but rather “how to best implement AI-powered search” with different approaches to integration, user experience, and business models.
Keywords: AI Search, RAG, Large Language Models, Search Architecture, Perplexity AI, Google Search, Conversational AI, SGE, Gemini.
Google’s AI Transformation: From PageRank to Gemini-Powered Search
Google has undergone one of the most significant transformations in its history, evolving from a traditional link-based search engine to an AI-powered answer engine. This transformation represents a strategic response to the rise of AI-first search platforms and changing user expectations.
The Search Generative Experience (SGE) Revolution
Google’s Search Generative Experience (SGE), now known as AI Overviews, fundamentally changes how search results are presented:
AI-Synthesized Answers: Instead of just providing links, Google’s AI generates comprehensive insights, explanations, and summaries from multiple sources
Contextual Understanding: Responses consider user context including location, search history, and preferences for personalized results
Multi-Step Query Handling: The system can handle complex, conversational queries that require reasoning and synthesis
Real-Time Information Grounding: AI overviews are grounded in current, real-time information while maintaining accuracy
Google’s LLM Arsenal
Google has strategically integrated multiple advanced AI models into its search infrastructure:
Gemini: The Multimodal Powerhouse
Capabilities: Understands and generates text, images, videos, and audio
Search Integration: Enables complex query handling including visual search, reasoning tasks, and detailed information synthesis
Multimodal Processing: Handles queries that combine text, images, and other media types
LaMDA: Conversational AI Foundation
Purpose: Powers natural, dialogue-like interactions in search
Features: Enables follow-up questions and conversational context maintenance
Role: Provides advanced language processing capabilities
Applications: Powers complex reasoning, translation (100+ languages), and contextual understanding
Scale: Handles extended documents and multimodal inputs
Technical Architecture Integration
Google’s approach differs from AI-first platforms by layering AI capabilities onto existing infrastructure:
Key Differentiators of Google’s AI Search
Hybrid Architecture: Maintains traditional search capabilities while adding AI-powered features
Scale Integration: Leverages existing massive infrastructure and data
DeepMind Synergy: Strategic integration of DeepMind research into commercial search applications
Continuous Learning: ML ranking algorithms and AI models learn from user interactions in real-time
Global Reach: AI features deployed across 100+ languages with localized understanding
Perplexity AI Architecture: The RAG-Powered Search Revolution
Perplexity AI represents a fundamental reimagining of search technology, built on three core innovations:
Retrieval-Augmented Generation (RAG): Combines real-time web crawling with large language model capabilities
Multi-Model Orchestration: Leverages multiple AI models (GPT, Claude, Gemini, Llama, DeepSeek) for optimal responses
Integrated Citation System: Provides transparent source attribution with every answer
The platform offers multiple access points to serve different user needs: Web Interface, Mobile App, Comet Browser, and Enterprise API.
Core Architecture Components
Simplified Architecture View
For executive presentations and high-level discussions, this three-layer view highlights the essential components:
How Perplexity Works: From Query to Answer
Understanding Perplexity’s workflow reveals why it delivers fundamentally different results than traditional search engines. Unlike Google’s approach of matching keywords to indexed pages, Perplexity follows a sophisticated multi-step process:
The Eight-Step Journey
Query Reception: User submits a natural language question through any interface
Real-Time Retrieval: Custom crawlers search the web for current, relevant information
Source Indexing: Retrieved content is processed and indexed in real-time
Context Assembly: RAG system compiles relevant information into coherent context
Model Selection: AI orchestrator chooses the optimal model(s) for the specific query type
Answer Generation: Selected model(s) generate comprehensive responses using retrieved context
Citation Integration: System automatically adds proper source attribution
Response Delivery: Final answer with citations is presented to the user
Technical Workflow Diagram
The sequence below shows how a user query flows through Perplexity’s system.
This process typically completes in under 3 seconds, delivering both speed and accuracy.
The New Search Paradigm: AI-First vs AI-Enhanced Approaches
The competition between Google and Perplexity has evolved beyond traditional vs. AI search to represent two distinct philosophies for implementing AI-powered search experiences.
The Future of AI-Powered Search: A New Competitive Landscape
The integration of AI into search has fundamentally changed the competitive landscape. Rather than a battle between traditional and AI search, we now see different approaches to implementing AI-powered experiences competing for user mindshare and market position.
Implementation Strategy Battle: Integration vs. Innovation
Google’s Integration Strategy:
Advantage: Massive user base and infrastructure to deploy AI features at scale
Challenge: Balancing AI innovation with existing business model dependencies
Approach: Gradual rollout of AI features while maintaining traditional search options
Perplexity’s Innovation Strategy:
Advantage: Clean slate design optimized for AI-first experiences
Challenge: Building user base and competing with established platforms
Approach: Focus on superior AI experience to drive user acquisition
The Multi-Modal Future
Both platforms are moving toward comprehensive multi-modal experiences:
Visual Search Integration: Google Lens vs. Perplexity’s image understanding capabilities
Voice-First Interactions: Google Assistant integration vs. conversational AI interfaces
Video and Audio Processing: Gemini’s multimodal capabilities vs. orchestrated model approaches
Document Intelligence: Enterprise document search and analysis capabilities
Business Model Evolution Under AI
Advertising Model Transformation:
Google must adapt its ad-centric model to AI Overviews without disrupting user experience
Challenge of monetizing direct answers vs. traditional click-through advertising
Need for new ad formats that work with conversational AI
Subscription and API Models:
Perplexity’s success with subscription tiers validates alternative monetization
Growing enterprise demand for AI-powered search APIs and integrations
Despite different starting points, both platforms are converging on similar technical capabilities:
Real-Time Information: Both now emphasize current, up-to-date information retrieval
Source Attribution: Transparency and citation becoming standard expectations
Conversational Context: Multi-turn conversation support across platforms
Model Diversity: Google developing multiple specialized models, Perplexity orchestrating external models
The Browser and Distribution Channel Wars
Perplexity’s Chrome Acquisition Strategy:
$34.5B all-cash bid for Chrome represents unprecedented ambition in AI search competition
Strategic Value: Control over browser defaults, user data, and search distribution
Market Impact: Success would fundamentally alter competitive dynamics and user acquisition costs
Regulatory Reality: Bid likely serves as strategic positioning and leverage rather than realistic acquisition
Alternative Distribution Strategies:
AI-native browsers (Comet) as specialized entry points
API integrations into enterprise and developer workflows
Mobile-first experiences capturing younger user demographics
Strategic Implications and Future Outlook
The competition between Google’s AI-enhanced approach and Perplexity’s AI-native strategy represents a fascinating case study in how established platforms and startups approach technological transformation differently.
Key Strategic Insights
The AI Integration Challenge: Google’s transformation demonstrates that even dominant platforms must fundamentally reimagine their core products to stay competitive in the AI era
Architecture Philosophy Matters: The choice between hybrid integration (Google) vs. AI-first design (Perplexity) creates different strengths, limitations, and user experiences
Business Model Pressure: AI-powered search challenges traditional advertising models, forcing experimentation with subscriptions, APIs, and premium features
User Behavior Evolution: Both platforms are driving the shift from “search and browse” to “ask and receive” interactions, fundamentally changing how users access information
The New Competitive Dynamics
Advantages of Google’s AI-Enhanced Approach:
Massive scale and infrastructure for global AI deployment
Existing user base to gradually transition to AI features
Deep integration with knowledge graphs and proprietary data
Ability to maintain traditional search alongside AI innovations
Advantages of Perplexity’s AI-Native Approach:
Optimized user experience designed specifically for conversational AI
Agility to implement cutting-edge AI techniques without legacy constraints
Model-agnostic architecture leveraging best-in-class external AI models
Clear value proposition for users seeking direct, cited answers
Looking Ahead: Industry Predictions
Near-Term (1-2 years):
Continued convergence of features between platforms
Google’s global rollout of AI Overviews across all markets and languages
Perplexity’s expansion into enterprise and specialized vertical markets
Emergence of more AI-native search platforms following Perplexity’s model
Medium-Term (3-5 years):
AI-powered search becomes the standard expectation across all platforms
Specialized AI search tools for professional domains (legal, medical, scientific research)
Integration of real-time multimodal capabilities (live video analysis, augmented reality search)
New regulatory frameworks for AI-powered information systems
Long-Term (5+ years):
Fully conversational AI assistants replace traditional search interfaces
Personal AI agents that understand individual context and preferences
Integration with IoT and ambient computing for seamless information access
Potential emergence of decentralized, blockchain-based search alternatives
Recommendations for Stakeholders
For Technology Leaders:
Hybrid Strategy: Consider Google’s approach of enhancing existing systems with AI rather than complete rebuilds
Model Orchestration: Investigate Perplexity’s approach of orchestrating multiple AI models for optimal results
Real-Time Capabilities: Invest in real-time information retrieval and processing systems
Citation Systems: Implement transparent source attribution to build user trust
For Business Strategists:
Revenue Model Innovation: Experiment with subscription, API, and premium feature models beyond traditional advertising
User Experience Focus: Prioritize conversational, answer-first experiences in product development
Distribution Strategy: Evaluate the importance of browser control and default search positions
Competitive Positioning: Decide between AI-enhancement of existing products vs. AI-native alternatives
For Investors:
Platform Risk Assessment: Evaluate how established platforms are adapting to AI disruption
Technology Differentiation: Assess the sustainability of competitive advantages in rapidly evolving AI landscape
Business Model Viability: Monitor the success of alternative monetization strategies beyond advertising
Regulatory Impact: Consider potential regulatory responses to AI-powered information systems and search market concentration
The future of search will be determined by execution quality, user adoption, and the ability to balance innovation with practical business considerations. Both Google and Perplexity have established viable but different paths forward, setting the stage for continued innovation and competition in the AI-powered search landscape.
Monitor the browser control battle and distribution channel acquisitions
Technology Differentiation: Assess the sustainability of competitive advantages in rapidly evolving AI landscape
Business Model Viability: Monitor the success of alternative monetization strategies beyond advertising
Regulatory Impact: Consider potential regulatory responses to AI-powered information systems and search market concentration
Conclusion
The evolution of search from Google’s traditional PageRank-driven approach to today’s AI-powered landscape represents one of the most significant technological shifts in internet history. Google’s recent transformation with its Search Generative Experience and Gemini integration demonstrates that even the most successful platforms must reinvent themselves to remain competitive in the AI era.
The competition between Google’s AI-enhanced strategy and Perplexity’s AI-native approach offers valuable insights into different paths for implementing AI at scale. Google’s hybrid approach leverages massive existing infrastructure while gradually transforming user experiences, while Perplexity’s clean-slate design optimizes entirely for conversational AI interactions.
As both platforms continue to evolve, the ultimate winners will be users who gain access to more intelligent, efficient, and helpful ways to access information. The future of search will likely feature elements of both approaches: the scale and comprehensiveness of Google’s enhanced platform combined with the conversational fluency and transparency of AI-native solutions.
The battle for search supremacy in the AI era has only just begun, and the innovations emerging from this competition will shape how humanity accesses and interacts with information for decades to come.
This analysis reflects the state of AI-powered search as of August 2025. The rapidly evolving nature of AI technology and competitive dynamics may significantly impact future developments. Both Google and Perplexity continue to innovate at unprecedented pace, making ongoing monitoring essential for stakeholders in this space.This analysis represents the current state of AI-powered search as of August 2025. The rapidly evolving nature of AI technology and competitive landscape may impact future developments.
“On the Sincerity and Mastery in Large Models” is a two-part essay inspired by Sun Simiao’s classical Chinese text On the Absolute Sincerity of Great Physicians. Written in classical Chinese style, it warns against superficial understanding and blind faith in large language models (LLMs). It calls for practitioners to uphold a spirit of diligence (“精”) and sincerity (“诚”)—to understand the inner principles of algorithms and the biases within data. The model is but a tool; its moral compass lies in the human operator. Only by combining technical rigor with ethical restraint can AI serve humanity and avoid causing harm. This is both a philosophical treatise on AI and a critique of today’s hasty tech culture.
应对人工智能的独特错误模式:布鲁斯·施奈尔(Bruce Schneier)和内森·E·桑德斯(Nathan E. Sanders)(网络安全视角)指出,人工智能系统,特别是大型语言模型(LLMs),其错误模式与人类错误显著不同——它们更难预测,不集中在知识空白处,且缺乏对自身错误的自我意识。他们提出双重研究方向:一是工程化人工智能以产生更易于人类理解的错误(例如,通过RLHF等精炼的对齐技术);二是开发专门针对人工智能独特“怪异”之处的新型安全与纠错系统(例如,迭代且多样化的提示)。
Abstract: this article presents three distinct, cross-disciplinary strategies for ensuring the safe and beneficial development of Artificial Intelligence.
Addressing Idiosyncratic AI Error Patterns (Cybersecurity Perspective): Bruce Schneier and Nathan E. Sanders highlight that AI systems, particularly Large Language Models (LLMs), exhibit error patterns significantly different from human mistakes—being less predictable, not clustered around knowledge gaps, and lacking self-awareness of error. They propose a dual research thrust: engineering AIs to produce more human-intelligible errors (e.g., through refined alignment techniques like RLHF) and developing novel security and mistake-correction systems specifically designed for AI’s unique “weirdness” (e.g., iterative, varied prompting).
Updating Ethical Frameworks to Combat AI Deception (Robotics & Internet Culture Perspective): Dariusz Jemielniak argues that Isaac Asimov’s traditional Three Laws of Robotics are insufficient for modern AI due to the rise of AI-enabled deception, including deepfakes, sophisticated misinformation campaigns, and manipulative AI interactions. He proposes a “Fourth Law of Robotics”: A robot or AI must not deceive a human being by impersonating a human being.Implementing this law would necessitate mandatory AI disclosure, clear labeling of AI-generated content, technical identification standards, legal enforcement, and public AI literacy initiatives to maintain trust in human-AI collaboration.
Establishing Rigorous Protocols for AGI Detection and Interaction (Astrobiology/SETI Perspective): Edmon Begoli and Amir Sadovnik suggest that research into Artificial General Intelligence (AGI) can draw methodological lessons from the Search for Extraterrestrial Intelligence (SETI). They advocate for a structured scientific approach to AGI that includes:
Developing clear, multidisciplinary definitions of “general intelligence” and related concepts like consciousness.
Creating robust, novel metrics and evaluation benchmarks for detecting AGI, moving beyond limitations of tests like the Turing Test.
Formulating internationally recognized post-detection protocols for validation, transparency, safety, and ethical considerations, should AGI emerge.
Collectively, these perspectives emphasize the urgent need for innovative, multi-faceted approaches—spanning security engineering, ethical guideline revision, and rigorous scientific protocol development—to proactively manage the societal integration and potential future trajectory of advanced AI systems.
Here are the full detailed content:
3 Ways to Keep AI on Our Side
AS ARTIFICIAL INTELLIGENCE reshapes society, our traditional safety nets and ethical frameworks are being put to the test. How can we make sure that AI remains a force for good? Here we bring you three fresh visions for safer AI.
In the first essay, security expert Bruce Schneier and data scientist Nathan E. Sanders explore how AI’s “weird” error patterns create a need for innovative security measures that go beyond methods honed on human mistakes.
Dariusz Jemielniak, an authority on Internet culture and technology, argues that the classic robot ethics embodied in Isaac Asimov’s famous rules of robotics need an update to counterbalance AI deception and a world of deepfakes.
And in the final essay, the AI researchers Edmon Begoli and Amir Sadovnik suggest taking a page from the search for intelligent life in the stars; they propose rigorous standards for detecting the possible emergence of human-level AI intelligence.
As AI advances with breakneck speed, these cross-disciplinary strategies may help us keep our hands on the reins.
AI Mistakes Are Very Different from Human Mistakes
WE NEED NEW SECURITY SYSTEMS DESIGNED TO DEAL WITH THEIR WEIRDNESS
Bruce Schneier & Nathan E. Sanders
HUMANS MAKE MISTAKES all the time. All of us do, every day, in tasks both new and routine. Some of our mistakes are minor, and some are catastrophic. Mistakes can break trust with our friends, lose the confidence of our bosses, and sometimes be the difference between life and death.
Over the millennia, we have created security systems to deal with the sorts of mistakes humans commonly make. These days, casinos rotate their dealers regularly, because they make mistakes if they do the same task for too long. Hospital personnel write on patients’ limbs before surgery so that doctors operate on the correct body part, and they count surgical instruments to make sure none are left inside the body. From copyediting to double-entry bookkeeping to appellate courts, we humans have gotten really good at preventing and correcting human mistakes.
Humanity is now rapidly integrating a wholly different kind of mistakemaker into society: AI. Technologies like large language models (LLMs) can perform many cognitive tasks traditionally fulfilled by humans, but they make plenty of mistakes. You may have heard about chatbots telling people to eat rocks or add glue to pizza. What differentiates AI systems’ mistakes from human mistakes is their weirdness. That is, AI systems do not make mistakes in the same ways that humans do.
Much of the risk associated with our use of AI arises from that difference. We need to invent new security systems that adapt to these differences and prevent harm from AI mistakes.
IT’S FAIRLY EASY to guess when and where humans will make mistakes. Human errors tend to come at the edges of someone’s knowledge: Most of us would make mistakes solving calculus problems. We expect human mistakes to be clustered: A single calculus mistake is likely to be accompanied by others. We expect mistakes to wax and wane depending on factors such as fatigue and distraction. And mistakes are typically accompanied by ignorance: Someone who makes calculus mistakes is also likely to respond “I don’t know” to calculus-related questions.
To the extent that AI systems make these humanlike mistakes, we can bring all of our mistake-correcting systems to bear on their output. But the current crop of AI models—particularly LLMs—make mistakes differently.
AI errors come at seemingly random times, without any clustering around particular topics. The mistakes tend to be more evenly distributed through the knowledge space; an LLM might be equally likely to make a mistake on a calculus question as it is to propose that cabbages eat goats. And AI mistakes aren’t accompanied by ignorance. An LLM will be just as confident when saying something completely and obviously wrong as it will be when saying something true.
The inconsistency of LLMs makes it hard to trust their reasoning in complex, multistep problems. If you want to use an AI model to help with a business problem, it’s not enough to check that it understands what factors make a product profitable; you need to be sure it won’t forget what money is.
THIS SITUATION INDICATES two possible areas of research: engineering LLMs to make mistakes that are more humanlike, and building new mistake-correcting systems that deal with the specific sorts of mistakes that LLMs tend to make.
We already have some tools to lead LLMs to act more like humans. Many of these arise from the field of “alignment” research, which aims to make models act in accordance with the goals of their human developers. One example is the technique that was arguably responsible for the breakthrough success of ChatGPT: reinforcement learning with human feedback. In this method, an AI model is rewarded for producing responses that get a thumbs-up from human evaluators. Similar approaches could be used to induce AI systems to make humanlike mistakes, particularly by penalizing them more for mistakes that are less intelligible.
When it comes to catching AI mistakes, some of the systems that we use to prevent human mistakes will help. To an extent, forcing LLMs to double-check their own work can help prevent errors. But LLMs can also confabulate seemingly plausible yet truly ridiculous explanations for their flights from reason.
Other mistake-mitigation systems for AI are unlike anything we use for humans. Because machines can’t get fatigued or frustrated, it can help to ask an LLM the same question repeatedly in slightly different ways and then synthesize its responses. Humans won’t put up with that kind of annoying repetition, but machines will.
RESEARCHERS ARE still struggling to understand where LLM mistakes diverge from human ones. Some of the weirdness of AI is actually more humanlike than it first appears.
Small changes to a query to an LLM can result in wildly different responses, a problem known as prompt sensitivity. But, as any survey researcher can tell you, humans behave this way, too. The phrasing of a question in an opinion poll can have drastic impacts on the answers.
LLMs also seem to have a bias toward repeating the words that were most common in their training data—for example, guessing familiar place names like “America” even when asked about more exotic locations. Perhaps this is an example of the human “availability heuristic” manifesting in LLMs; like humans, the machines spit out the first thing that comes to mind rather than reasoning through the question. Also like humans, perhaps, some LLMs seem to get distracted in the middle of long documents; they remember more facts from the beginning and end.
In some cases, what’s bizarre about LLMs is that they act more like humans than we think they should. Some researchers have tested the hypothesis that LLMs perform better when offered a cash reward or threatened with death. It also turns out that some of the best ways to “jailbreak” LLMs (getting them to disobey their creators’ explicit instructions) look a lot like the kinds of social-engineering tricks that humans use on each otherfor example, pretending to be someone else or saying that the request is just a joke. But other effective jailbreaking techniques are things no human would ever fall for. One group found that if they used ASCII art (constructions of symbols that look like words or pictures) to pose dangerous questions, like how to build a bomb, the LLM would answer them willingly.
Humans may occasionally make seemingly random, incomprehensible, and inconsistent mistakes, but such occurrences are rare and often indicative of more serious problems. We also tend not to put people exhibiting these behaviors in decision-making positions. Likewise, we should confine AI decision-making systems to applications that suit their actual abilities—while keeping the potential ramifications of their mistakes firmly in mind.
Asimov’s Laws of Robotics Need an Update for AI PROPOSING A FOURTH LAW OF ROBOTICS
Dariusz Jemielniak
IN 1942, the legendary science fiction author Isaac Asimov introduced his Three Laws of Robotics in his short story “Runaround.” The laws were later popularized in his seminal story collection I, Robot.
FIRST LAW: A robot may not injure a human being or, through inaction, allow a human being to come to harm.
SECOND LAW: A robot must obey the orders given it by human beings except where such orders would conflict with the First Law.
THIRD LAW: A robot must protect its own existence as long as such protection does not conflict with the First or Second Law.
While drawn from works of fiction, these laws have shaped discussions of robot ethics for decades. And as AI systems—which can be considered virtual robots—have become more sophisticated and pervasive, some technologists have found Asimov’s framework useful for considering the potential safeguards needed for AI that interacts with humans.
But the existing three laws are not enough. Today, we are entering an era of unprecedented human-AI collaboration that Asimov could hardly have envisioned. The rapid advancement of generative AI, particularly in language and image generation, has created challenges beyond Asimov’s original concerns about physical harm and obedience.
THE PROLIFERATION of AI-enabled deception is particularly concerning. According to the FBI’s most recent Internet Crime Report, cybercrime involving digital manipulation and social engineering results in annual losses counted in the billions. The European Union Agency for Cybersecurity’s ENISA Threat Landscape 2023 highlighted deepfakes—synthetic media that appear genuine—as an emerging threat to digital identity and trust.
Social-media misinformation is a huge problem today. I studied it during the pandemic extensively and can say that the proliferation of generative AI tools has made its detection increasingly difficult. AI-generated propaganda is often just as persuasive as or even more persuasive than traditional propaganda, and bad actors can very easily use AI to create convincing content. Deepfakes are on the rise everywhere. Botnets can use AI-generated text, speech, and video to create false perceptions of widespread support for any political issue. Bots are now capable of making phone calls while impersonating people, and AI scam calls imitating familiar voices are increasingly common. Any day now, we can expect a boom in video-call scams based on AI-rendered overlay avatars, allowing scammers to impersonate loved ones and target the most vulnerable populations.
Even more alarmingly, children and teenagers are forming emotional attachments to AI agents, and are sometimes unable to distinguish between interactions with real friends and bots online. Already, there have been suicides attributed to interactions with AI chatbots.
In his 2019 book Human Compatible (Viking), the eminent computer scientist Stuart Russell argues that AI systems’ ability to deceive humans represents a fundamental challenge to social trust. This concern is reflected in recent policy initiatives, most notably the European Union’s AI Act, which includes provisions requiring transparency in AI interactions and transparent disclosure of AI-generated content. In Asimov’s time, people couldn’t have imagined the countless ways in which artificial agents could use online communication tools and avatars to deceive humans.
Therefore, we must make an addition to Asimov’s laws.
FOURTH LAW: A robot or AI must not deceive a human being by impersonating a human being.
WE NEED CLEAR BOUNDARIES. While human-AI collaboration can be constructive, AI deception undermines trust and leads to wasted time, emotional distress, and misuse of resources. Artificial agents must identify themselves to ensure our interactions with them are transparent and productive. AI-generated content should be clearly marked unless it has been significantly edited and adapted by a human.
Implementation of this Fourth Law would require
mandatory AI disclosure in direct interactions,
clear labeling of AI-generated content,
technical standards for AI identification,
legal frameworks for enforcement, and
educational initiatives to improve AI literacy.
Of course, all this is easier said than done. Enormous research efforts are already underway to find reliable ways to watermark or detect AI-generated text, audio, images, and videos. But creating the transparency I’m calling for is far from a solved problem.
The future of human-AI collaboration depends on maintaining clear distinctions between human and artificial agents. As noted in the IEEE report Ethically Aligned Design, transparency in AI systems is fundamental to building public trust and ensuring the responsible development of artificial intelligence.
Asimov’s complex stories showed that even robots that tried to follow the rules often discovered there were unintended consequences to their actions. Still, having AI systems that are at least trying to follow Asimov’s ethical guidelines would be a very good start.
What Can AI Researchers Learn from Alien Hunters?
THE SETI INSTITUTE’S APPROACH HAS LESSONS FOR RESEARCH ON ARTIFICIAL GENERAL INTELLIGENCE
Edmon Begoli & Amir Sadovnik
THE EMERGENCE OF artificial general intelligence (systems that can perform any intellectual task a human can) could be the most important event in human history. Yet AGI remains an elusive and controversial concept. We lack a clear definition of what it is, we don’t know how to detect it, and we don’t know how to interact with it if it finally emerges.
What we do know is that today’s approaches to studying AGI are not nearly rigorous enough. Companies like OpenAI are actively striving to create AGI, but they include research on AGI’s social dimensions and safety issues only as their corporate leaders see fit. And academic institutions don’t have the resources for significant efforts.
We need a structured scientific approach to prepare for AGI. A useful model comes from an unexpected field: the search for extraterrestrial intelligence, or SETI. We believe that the SETI Institute’s work provides a rigorous framework for detecting and interpreting signs of intelligent life.
The idea behind SETI goes back to the beginning of the space age. In their 1959 Nature paper, the physicists Giuseppe Cocconi and Philip Morrison suggested ways to search for interstellar communication. Given the uncertainty of extraterrestrial civilizations’ existence and sophistication, they theorized about how we should best “listen” for messages from alien societies.
We argue for a similar approach to studying AGI, in all its uncertainties. The last few years have shown a vast leap in AI capabilities. The large language models (LLMs) that power chatbots like ChatGPT and enable them to converse convincingly with humans have renewed the discussion of AGI. One notable 2023 preprint even argued that ChatGPT shows “sparks” of AGI, and today’s most cutting-edge language models are capable of sophisticated reasoning and outperform humans in many evaluations.
While these claims are intriguing, there are reasons to be skeptical. In fact, a large group of scientists have argued that the current set of tools won’t bring us any closer to true AGI. But given the risks associated with AGI, if there is even a small likelihood of it occurring, we must make a serious effort to develop a standard definition of AGI, establish a SETI-like approach to detecting it, and devise ways to safely interact with it if it emerges.
THE CRUCIAL FIRST step is to define what exactly to look for. In SETI’s case, researchers decided to look for certain narrowband signals that would be distinct from other radio signals present in the cosmic background. These signals are considered intentional and only produced by intelligent life. None have been found so far.
In the case of AGI, matters are far more complicated. Today, there is no clear definition of artificial general intelligence. The term is hard to define because it contains other imprecise and controversial terms. Although intelligence has been defined by the Oxford English Dictionary as “the ability to acquire and apply knowledge and skills,” there is still much debate on which skills are involved and how they can be measured. The term general is also ambiguous. Does an AGI need to be able to do absolutely everything a human can do?
One of the first missions of a “SETI for AGI” project must be to clearly define the terms general and intelligence so the research community can speak about them concretely and consistently. These definitions need to be grounded in disciplines such as computer science, measurement science, neuroscience, psychology, mathematics, engineering, and philosophy.
There’s also the crucial question of whether a true AGI must include consciousness and self-awareness. These terms also have multiple definitions, and the relationships between them and intelligence must be clarified. Although it’s generally thought that consciousness isn’t necessary for intelligence, it’s often intertwined with discussions of AGI because creating a self-aware machine would have many philosophical, societal, and legal implications.
NEXT COMES the task of measurement. In the case of SETI, if a candidate narrowband signal is detected, an expert group will verify that it is indeed from an extraterrestrial source. They’ll use established criteria—for example, looking at the signal type and checking for repetition—and conduct assessments at multiple facilities for additional validation.
How to best measure computer intelligence has been a long-standing question in the field. In a famous 1950 paper, Alan Turing proposed the “imitation game,” more widely known as the Turing Test, which assesses whether human interlocutors can distinguish if they are chatting with a human or a machine. Although the Turing Test was useful in the past, the rise of LLMs has made clear that it isn’t a complete enough test to measure intelligence. As Turing himself noted, the relationship between imitating language and thinking is still an open question.
Future appraisals must be directed at different dimensions of intelligence. Although measures of human intelligence are controversial, IQ tests can provide an initial baseline to assess one dimension. In addition, cognitive tests on topics such as creative problem-solving, rapid learning and adaptation, reasoning, and goal-directed behavior would be required to assess general intelligence.
But it’s important to remember that these cognitive tests were designed for humans and might contain assumptions that might not apply to computers, even those with AGI abilities. For example, depending on how it’s trained, a machine may score very high on an IQ test but remain unable to solve much simpler tasks. In addition, an AI may have new abilities that aren’t measurable by our traditional tests. There’s a clear need to design novel evaluations that can alert us when meaningful progress is made toward AGI.
IF WE DEVELOP AGI, we must be prepared to answer questions such as: Is the new form of intelligence a new form of life? What kinds of rights does it have? What are the potential safety concerns, and what is our approach to containing the AGI entity?
Here, too, SETI provides inspiration. SETI’s postdetection protocols emphasize validation, transparency, and international cooperation, with the goal of maximizing the credibility of the process, minimizing sensationalism, and bringing structure to such a profound event. Likewise, we need internationally recognized AGI protocols to bring transparency to the entire process, apply safety-related best practices, and begin the discussion of ethical, social, and philosophical concerns.
We readily acknowledge that the SETI analogy can go only so far. If AGI emerges, it will be a human-made phenomenon. We will likely gradually engineer AGI and see it slowly emerge, so detection might be a process that takes place over a period of years, if not decades. In contrast, the existence of extraterrestrial life is something that we have no control over, and contact could happen very suddenly.
The consequences of a true AGI are entirely unpredictable. To best prepare, we need a methodical approach to defining, detecting, and interacting with AGI, which could be the most important development in human history.
In October of 2024, I was invited by Dr Lydia C. and Dr Peng C to give two presentations as a guest lecturer at La Trobe University (Melbourne) to the students enrolled with CSE5DMI Data Mining and CSE5ML Machine Learning.
The lectures are focused on data mining and machine learning applications and practice in industry and digital retail; and how students should prepare themselves for their future. Attendees are postgraduate students currently enrolled in CSE5ML or CSE5DMI in 2024 Semester 2, approximately 150 students for each subject (CSE5ML or CSE5DMI) who are pursuing one of the following degrees:
This repository is intended for educational purposes only. The content, including presentations and case studies, is provided “as is” without any warranties or guarantees of any kind. The authors and contributors are not responsible for any errors or omissions, or for any outcomes related to the use of this material. Use the information at your own risk. All trademarks, service marks, and company names are the property of their respective owners. The inclusion of any company or product names does not imply endorsement by the authors or contributors.
This is public repository aiming to share the lecture for the public. The *.excalidraw files can be download and open on https://excalidraw.com/)
A recommendation system is an artificial intelligence or AI algorithm, usually associated with machine learning, that uses Big Data to suggest or recommend additional products to consumers. These can be based on various criteria, including past purchases, search history, demographic information, and other factors.
This presentation is developed for students of CSE5ML LaTrobe University, Melbourne and used in the guest lecture on 2024 October 14.
Entity matching – the task of clustering duplicated database records to underlying entities.”Given a large collection of records, cluster these records so that the records in each cluster all refer to the same underlying entity.”
This presentation is developed for students of CSE5DMI LaTrobe University, Melbourne and used in the guest lecture on 2024 October 15.
Contribution to the Company and Society
This journey is also align to the Company’s strategy.
Being invited to be a guest lecturer for students with related knowledge backgrounds in 2024 aligns closely with EDG’s core values of “weʼre real, weʼre inclusive, weʼre responsible”.
By participating in a guest lecture and discussion on data analytics and AI/ML practice beyond theories, we demonstrate our commitment to sharing knowledge and expertise, embodying our responsibility to contribute positively to the academic community and bridge the gap between theory builders and problem solvers.
This event allows us to inspire and educate students in the same domains at La Trobe University, showcasing our passion and enthusiasm for the business. Through this engagement, we aim to positively impact attendees, providing suggestions for their career paths, and fostering a spirit of collaboration and continuous learning.
Showing our purpose, values, and ways of working will impress future graduates who may want to come and work for us, want to stay and thrive with us. It also helps us deliver on our purpose to create a more sociable future, together.
Moreover, I am grateful for all the support and encouragement I have received from my university friends and teammates throughout this journey. Additionally, the teaching resources and environment in the West Lecture Theatres at La Trobe University are outstanding!
Credits: this post is a notebook of the key points from YouTube Content Creator Programming with Mosh's video with some editorial works. TL,DR,: watch the video.
Is coding still worth learning in 2024?
This can be a common question for a lot of people especially the younger generation of students when they try to choose a career path with some kind of insurance for future incomings.
People are worried that AI is going to replace software engineers, or any engineer related to coding and designs.
As you know, we should trust solid data instead of media and hearsay in the digital area. Social media have been creating this anxious feeling that every job is going to collapse because of AI. Coding has no future.
But I’ve got a different take backed up by real-world numbers as follows.
Note: In this post, “software engineer” represents all groups of coders (data engineer, data analyst, data scientist, machine learning engineer, frontend/backend/full-stack developers, programmers and researchers).
Is AI replacing software engineers?
The short answer is NO.
But there is a lot of fear about AI replacing coders. Headling scream robots taking over jobs and it can be overwhelming. But the truth is:
AI is not going to take you jobs; instead it is the People who can work with AI will have the advantage, and probabley will take your job.
Software engineering is not going away at least not anytime soon in our generation. Here are some data to back this up.
The US Bureau of Labor and Statistics (BLS) is a government agency that tracks job growth across the country on its website. From the data, we see that there is a continued demand for software developers, and computer and information scientists.
They claimed that the requirement for software developers is expected to grow by 26% from 2022 to 2032, while the average across all occupations is only 3%. This is a strong indication that software engineering is here to stay.
In our lives, the research and development conducted by computer and information research scientists turn ideas into technology. As demand for new and better technology grows, demand for computer and information research scientists will grow as well.
There is a similar trend for Computer and Information Research Scientists, which is expected to grow by 23% from 2022 to 2032.
To better understand the impact of AI on software engineering, let’s do a quick revisit of the history of programming.
In the early days of programming, engineers wrote codes in a way that only the computer understood. Then, we create compilers, we can program in a human-readable language like C++ and Jave without worrying about how the code should eventually get converted into zeros and ones, and where it will get stored in the memory.
Here is the fact
Compilers did not replace programmers. They made them more efficient!
Since then we have built so many software applications and totally changed the world.
The problem with AI-generated code
AI will likely do the same as changing the future, we will be able to delegate routine and repetitive coding tasks to AI, so we can focus on complex problem-solving, design and innovation.
This will allow us to build more sophisticated software applications most people can not even imagine today. But even then, just because AI can generate code doesn’t mean we can or we should delegate the entire coding aspect of software development to AI because
AI-Generated Code is Lower-Quality, we still need to review and refine it before using it in the production.
source: Abstract of the 2023 Data Shows Downward Pressure on Code Quality
So, yes, we can produce more code with AI. but
More Code != Better Code
Humans should always review and refine AI-generated code for quality and security before deploying it to production. That means all the coding skills that software engineer currently has will continue to stay relevant in the future.
You still need the knowledge of data structure and algorithms programming languages and their tricky parts, tools and frameworks, you still need to have all that knowledge to review and refine the AI-generated code, you will just spend less time typing it into the computer.
So anyone telling you that you can use natural language to build software without understanding anything about coding is out of touch with the reality of software engineering (or he is trying to sell you something, i.e., GPUs).
Of course, you can make a dummy app with AI in minutes, but this is not the same kind of software that runs our banks, transportation, healthcare, security and more. These are the software/systems that really matter, and our life depends on them. We can’t let a code monkey talk to a chatbot in English and get that software built. At least, this will not happen in our lifetime.
In the future, we will probably spend more time designing new features and products with AI instead of writing boilerplate code. We will likely delegate aspects of coding to AI, but this doesn’t mean we don’t need to learn to code.
As a software engineer or any coding practitioner, you will always need to review what AI generates and refine it either by hand or by guiding the AI to improve the code.
Keep in mind that Coding is only one small part of a software engineer’s job, we often spend most of our time talking to people, understanding requirements, writing stories, discussing software/system architecture, etc.
Instead of being worried about AI, I’m more concerned about Human Intelligence!
Does AI really make you code faster?
AI can only boost our programming productivity but not necessarily the overall productivity.
As you can see, AI helped the most with documentation and code generation to some extent, but when moving to code refactoring, the improvement dropped to 20% and for high-complexity tasks, it was less than 10%.
Time savings shrank to less than 10 percent on tasks that developers deemed high in complexity due to, for example, their lack of familiarity with a necessary programming framework.
Thus, if anyone tells you that software engineers will be obsolete in 5 years, they are either ignorant or trying to sell you something.
In fact, some studies tell that the role of software engineers (coders) may become more valuable as they will be needed to develop, manage and maintain these AI systems.
They (software engineers) need to understand all the complexity of building software and use AI to boost their productivity.
Can one AI-powered engineer do the work of many?
Now, people are worried that one Senior Engineer can simply use AI to replace many Engineers, eventually, leaving no job opportunities for juniors.
But again this is a fallacy because the time saving you get from AI is not as great as you are promised in reality. Anyone who uses AI to generate code knows that. It takes effort to get the right prompts for usable results, and the code still needs polishing.
Thus, it is not like one engineer will suddenly have so much free time to do the job of many people.
But you may ask, this is now, what about the future? Maybe in a year or two, AI will start to build software like a human.
In theory, yes, AI is advancing and one day it may even reach and surpass human intelligence. But Einstein said:
In Theory, Theory and Practice are the Same.
In Practice, they are NOT.
The reality is that while machines may be able to handle repetitive and routine tasks, human creativity and expertise will still be necessary for developing complex solutions and strategies.
Software engineering will be extremely important over the next several decades. I don’t think it is going away in the future, but I do believe it will change.
Future of Software Engineering
Software powers our world and that will not change anytime soon.
In future, we have to learn how to input the right prompt into our AI tools to get the expected result. This is not an easy skill to develop, it requires problem-solving capability as well as programming knowledge of languages and tools. So, if you’ve already made up your mind and don’t want to invest your time in software engineering or coding. That’s perfectly fine. Follow your passion!
The coding tools will evolve as they always do, but the true coding skill lies in learning and adapting. The future engineer needs today’s coding skills and a good understanding to use AI effectively. The future brings more complexity and demands more knowledge and adaptability from software engineers.
If you like building things with code, and if the idea of shaping the future with technology gets you excited, don’t let negativity and fear of Gen-AIs hold you back.
German Navy Ciphertext by Enigma M3: OJSBI BUPKA ECMEE ZH
German Message: Ziel hafen von DOVER
English Translation: Target port of DOVER
Enigma Mission – X
By running the notebook, it is not difficult to complete the deciphering process with the “keys” to get the original message from the ciphertext by the German Navy.
Notebook Outputs Example
However, it will be difficult to break down the cipher without knowing the keys. That will be the Turing-Welchman Bombe Simulator challenge.
About Enigma Mission X
Mission X is a game for programmers to accomplish the deciphering job required by Dr Alan Turing.
Mission X Letter from Alan Turning
Programmers need to break the secret with limited information as follows.
Prompt engineering is like adjusting audio without opening the equipment.
Introduction
Prompt Engineering, also known as In-Context Prompting, refers to methods for communicating with a Large Language Model (LLM) like GPT (Generative Pre-trained Transformer) to manipulate/steer its behaviour for expected outcomes without updating, retraining or fine-tuning the model weights.
Researchers, developers, or users may engage in prompt engineering to instruct a model for specific tasks, improve the model’s performance, or adapt it to better understand and respond to particular inputs. It is an empirical science and the effect of prompt engineering methods can vary a lot among models, thus requiring heavy experimentation and heuristics.
This post only focuses on prompt engineering for autoregressive language models, so nothing with image generation or multimodality models.
Basic Prompting
Zero-shot and few-shot learning are the two most basic approaches for prompting the model, pioneered by many LLM papers and commonly used for benchmarking LLM performance. That is to say, Zero-shot and few-shot testing are scenarios used to evaluate the performance of large language models (LLMs) in handling tasks with little or no training data. Here are examples for both:
Zero-shot
Zero-shot learning simply feeds the task text to the model and asks for results.
Scenario: Text Completion (Please try the following input in ChatGPT or Google Bard)
Input:
Task: Complete the following sentence:
Input: The capital of France is ____________.
Output (ChatGPT / Bard):
Output: The capital of France is Paris.
Few-shot
Few-shot learning presents a set of high-quality demonstrations, each consisting of both input and desired output, on the target task. As the model first sees good examples, it can better understand human intention and criteria for what kinds of answers are wanted. Therefore, few-shot learning often leads to better performance than zero-shot. However, it comes at the cost of more token consumption and may hit the context length limit when the input and output text are long.
Scenario: Text Classification
Input:
Task: Classify movie reviews as positive or negative.
Examples: Review 1: This movie was amazing! The acting was superb. Sentiment: Positive Review 2: I couldn't stand this film. The plot was confusing. Sentiment: Negative
Question: Review: I'll bet the video game is a lot more fun than the film. Sentiment:____
Output
Sentiment: Negative
Many studies have explored the construction of in-context examples to maximize performance. They observed that the choice of prompt format, training examples, and the order of the examples can significantly impact performance, ranging from near-random guesses to near-state-of-the-art performance.
Hallucination
In the context of Large Language Models (LLMs), hallucination refers to a situation where the model generates outputs that are incorrect or not grounded in reality. A hallucination occurs when the model produces information that seems plausible or coherent but is actually not accurate or supported by the input data.
For example, in a language generation task, if a model is asked to provide information about a topic and it generates details that are not factually correct or have no basis in the training data, it can be considered as hallucination. This phenomenon is a concern in natural language processing because it can lead to the generation of misleading or false information.
Addressing hallucination in LLMs is a challenging task, and researchers are actively working on developing methods to improve the models’ accuracy and reliability. Techniques such as fine-tuning, prompt engineering, and designing more specific evaluation metrics are among the approaches used to mitigate hallucination in language models.
Perfect Prompt Formula for ChatBots
For personal daily documenting work such as text generation, there are six key components making up the perfect formula for ChatGPT and Google Bard:
Task, Context, Exemplars, Persona, Format, and Tone.
Prompt Formula for ChatBots
The Task sentence needs to articulate the end goal and start with an action verb.
Use three guiding questions to help structure relevant and sufficient Context.
Exemplars can drastically improve the quality of the output by giving specific examples for the AI to reference.
For Persona, think of who you would ideally want the AI to be in the given task situation.
Visualizing your desired end result will let you know what format to use in your prompt.
And you can actually use ChatGPT to generate a list of Tone keywords for you to use!
If you are ever curious about what the heck are those techies talking about with the above words? Please continues …
OK, so here’s the deal. We’re diving into the world of academia, talking about machine learning and large language models in the computer science and engineering domains. I’ll try to explain it in a simple way, but you can always dig deeper into these topics elsewhere.
RAG: Retrieval-Augmented Generation
RAG (Retrieval-Augmented Generation): RAG typically refers to a model that combines both retrieval and generation approaches. It might use a retrieval mechanism to retrieve relevant information from a database or knowledge base and then generate a response based on that retrieved information. In real applications, the users’ input and the model’s output will be pre/post-processed to follow certain rules and obey laws and regulations.
RAG: Retrieval-Augmented Generation
Here is a simplified example of using a Retrieval-Augmented Generation (RAG) model for a question-answering task. In this example, we’ll use a system that retrieves relevant passages from a knowledge base and generates an answer based on that retrieved information.
Input:
User Query: What are the symptoms of COVID-19?
Knowledge Base:
1. Title: Symptoms of COVID-19 Content: COVID-19 symptoms include fever, cough, shortness of breath, fatigue, body aches, loss of taste or smell, sore throat, etc.
2. Title: Prevention measures for COVID-19 Content: To prevent the spread of COVID-19, it's important to wash hands regularly, wear masks, practice social distancing, and get vaccinated.
3. Title: COVID-19 Treatment Content: COVID-19 treatment involves rest, hydration, and in severe cases, hospitalization may be required.
RAG Model Output:
Generated Answer:
The symptoms of COVID-19 include fever, cough, shortness of breath, fatigue, body aches, etc.
Remark: ChatGPT 3.5 will give basic results like the above. But, Google Bard will provide extra resources like CDC links and other sources it gets from the Search Engines. We could guess Google used a different framework to OpenAI.
CoT: Chain-of-Thought
Chain-of-thought (CoT) prompting (Wei et al. 2022) generates a sequence of short sentences to describe reasoning logics step by step, known as reasoning chains or rationales, to eventually lead to the final answer.
The benefit of CoT is more pronounced for complicated reasoning tasks while using large models (e.g. with more than 50B parameters). Simple tasks only benefit slightly from CoT prompting.
Tree of Thoughts (Yao et al. 2023) extends CoT by exploring multiple reasoning possibilities at each step. It first decomposes the problem into multiple thought steps and generates multiple thoughts per step, essentially creating a tree structure. The search process can be BFS or DFS while each state is evaluated by a classifier (via a prompt) or majority vote.
CoT : Chain-of-Thought and ToT: Tree-of-Thought
Self-Ask + Search Engine
Self-Ask (Press et al. 2022) is a method to repeatedly prompt the model to ask follow-up questions to construct the thought process iteratively. Follow-up questions can be answered by search engine results.
Self-Ask+Search Engine Example
ReAct: Reasoning and Acting
ReAct (Reason + Act; Yao et al. 2023) combines iterative CoT prompting with queries to Wikipedia APIs to search for relevant entities and content and then add it back into the context.
In each trajectory consists of multiple thought-action-observation steps (i.e. dense thought), where free-form thoughts are used for various purposes.
Specifically, from the paper, the authors use a combination of thoughts that decompose questions (“I need to search x, find y, then find z”), extract information from Wikipedia observations (“x was started in 1844”, “The paragraph does not tell x”), perform commonsense (“x is not y, so z must instead be…”) or arithmetic reasoning (“1844 < 1989”), guide search reformulation (“maybe I can search/lookup x instead”), and synthesize the final answer (“…so the answer is x”).
DSP: Directional Stimulus Prompting
Directional Stimulus Prompting (DSP, Z. Li 2023), is a novel framework for guiding black-box large language models (LLMs) toward specific desired outputs. Instead of directly adjusting LLMs, this method employs a small tunable policy model to generate an auxiliary directional stimulus (hints) prompt for each input instance.
DSP: Directional Stimulus Prompting
Summary and Conclusion
Prompt engineering involves carefully crafting these prompts to achieve desired results. It can include experimenting with different phrasings, structures, and strategies to elicit the desired information or responses from the model. This process is crucial because the performance of language models can be sensitive to how prompts are formulated.
I believe a lot of researchers will agree with me. Some prompt engineering papers don’t need to be 8 pages long. They could explain the important points in just a few lines and use the rest for benchmarking.
As researchers and developers delve further into the realms of prompt engineering, they continue to push the boundaries of what these sophisticated models can achieve.
To achieve this, it’s important to create a user-friendly LLM benchmarking system that many people will use. Developing better methods for creating prompts will help advance language models and improve how we use LLMs. These efforts will have a big impact on natural language processing and related fields.
This blog discusses the scale of Large Language Models (LLMs) and their impact on performance. LLMs like GPT, LaMDA, and PaLM have billions of parameters, raising questions about the consequences of their continued growth.
The journey of an LLM involves two stages: pre-training and scenario application. Pre-training focuses on optimizing the model using cross-entropy, while scenario application evaluates the model’s performance in specific use cases. Evaluating an LLM’s quality requires considering both stages, rather than relying solely on pre-training indicators.
Increasing training data, model parameters, and training time has been found to enhance performance in the pre-training stage. OpenAI and DeepMind have explored this issue, with OpenAI finding that a combination of more data and parameters, along with fewer training steps, produces the best results. DeepMind considers the amount of training data and model parameters equally important.
The influence of model size on downstream tasks varies. Linear tasks show consistent improvement as the model scales, while breakthrough tasks only benefit from larger models once they reach a critical scale. Tasks involving logical reasoning demonstrate sudden improvement at specific model scales. Some tasks exhibit U-shaped growth, where performance initially declines but then improves with larger models.
Reducing the LLM’s parameters while increasing training data proportionally can decrease the model’s size without sacrificing performance, leading to faster inference speed.
Understanding the impact of model size on both pre-training and downstream tasks is vital for optimizing LLM performance and exploring the potential of these language models.
Introduction
In recent years, we’ve witnessed a surge in the size of Large Language Models (LLMs), with models now boasting over 100 billion parameters becoming the new standard. Think OpenAI’s GPT-3 (175B), Google’s LaMDA (137B), PaLM (540B), and other global heavyweights. China, too, contributes to this landscape with models like Zhiyuan GLM, Huawei’s “Pangu,” Baidu’s “Wenxin,” etc. But here’s the big question: What unfolds as these LLMs continue to grow?
The journey of pre-trained models involves two crucial stages: pre-training and scenario application.
In the pre-training stage, the optimization goal is cross entropy. For autoregressive language models such as GPT, it is to see whether LLM correctly predicts the next word;
However, the real test comes in the scenario application stage, where specific use cases dictate evaluation criteria. Generally, our intuition is that if the LLM has better indicators in the pre-training stage, its ability to solve downstream tasks will naturally be stronger. However, this is not entirely true.
Existing research has proven that the optimization index in the pre-training stage does show a positive correlation with downstream tasks, but it is not completely positive. In other words, it is not enough to only look at the indicators in the pre-training stage to judge whether an LLM model is good enough. Based on this, we will look separately at these two different stages to see what the impact will be as the LLM model increases.
Part One: pre-training phase
First, let’s look at what happens as the model size gradually increases during the pre-training stage. OpenAI specifically studied this issue in “Scaling Laws for Neural Language Models” and proposed the “scaling law” followed by the LLM model.
As shown in the figure above, this study proves that when we independently increase (1) the amount of training data, (2) model parameter size and (3) extend the model training time (such as from 1 Epoch to 2 Epochs), the Loss of the pre-trained model on the test set will decrease monotonically. In other words, the model’s effectiveness is improving steadily.
Since all three factors are important when we actually do pre-training, we have a decision-making problem on how to allocate computing power:
Question: Assuming that the total computing power budget used to train LLM (such as fixed GPU hours or GPU days) is given. How to allocate computing power?
Should we increase the amount of data and reduce model parameters?
Or should we increase the amount of data and model size at the same time but reduce the number of training steps?
Open AI
As one zero-sum game, the scale of one-factor increases, and the scale of other factors must be reduced to keep the total computing power unchanged, so there are various possible computing power allocation plans.
In the end, OpenAI chose to increase the amount of training data and model parameters at the same time but used an early stopping strategy to reduce the number of training steps. Because it proves that: for the two elements of training data volume and model parameters, if you only increase one of them separately, this is not the best choice. It is better to increase both at the same time according to a certain proportion. Its conclusion is to give priority to increasing the model parameters, and then the amount of training data.
Assuming that the total computing power budget used to train LLM increases by 10 times, then the amount of model parameters should be increased by 5.5 times and the amount of training data should be increased by 1.8 times. At this time, the model gets the best performance.
Its basic conclusions are similar to those of OpenAI. For example, it is indeed necessary to increase the amount of training data and model parameters at the same time, so that the model effect will be better.
Many large models do not consider this when doing pre-training. Many large LLM models were trained just monotonically increasing the model parameters while fixing the amount of training data. This approach is wrong and limits the potential of the LLM model.
However, DeepMind corrects the proportional relationship between the two by OpenAI and believes that the amount of training data and model parameters are equally important.
In other words, assuming that the total computing power budget used to train LLM increases by 10 times, the number of model parameters should be increased by 3.3 times, and the amount of training data should also be increased by 3.3 times to get the best model.
This means that increasing the amount of training data is more important than we previously thought. Based on this understanding, DeepMind chose another configuration in terms of computing power allocation when designing the Chinchilla model: compared with the Gopher model with a data volume of 300B and a model parameter volume of 280B, Chinchilla chose to increase the training data by 4 times, but reduced the model The parameters are reduced to one-fourth that of Gopher, which is about 70B. However, regardless of pre-training indicators or many downstream task indicators, Chinchilla is better than the larger Gopher.
This brings us to the following enlightenment:
We can choose to enlarge the training data and reduce the LLM model parameters in the same proportion to achieve the purpose of greatly reducing the size of the model without reducing the model performance.
Reducing the size of the model has many benefits, such as the inference speed will be much faster when applied. This is undoubtedly a promising development route for LLM.
Part Two: downstream tasks
The above is the impact of the model scale from the pre-training stage. From the perspective of the effect of LLM on solving specific downstream tasks, as the model scale increases, different types of tasks have different performances.
Specifically, there are the following three types of tasks.
(a) Tasks that achieve the highest linearity scores see model performance improve predictably with scale and typically rely on knowledge and simple textual manipulations.
(b) Tasks with high breakthroughs do not see model performance improve until the model reaches a critical scale. These tasks generally require sequential steps or logical reasoning. Around 5% of BIG-bench tasks see models achieve sudden score breakthroughs with increasing scale.
(c) Tasks that achieve the lowest (negative) linearity scores see model performance degrade with scale.
Linearity Tasks
The first type of task perfectly reflects the scaling law of the LLM model, which means that as the model scale gradually increases, the performance of the tasks gets better and better, as shown in (a) above.
Such tasks usually have the following common characteristics: they are often knowledge-intensive tasks. That is to say, if the LLM model contains more knowledge, the performance of such tasks will be better.
Many studies have proven that the larger the LLM model, the higher the learning efficiency. For the same amount of training data, the larger the model, the better the performance. This shows that even when faced with the same batch of training data, a larger LLM model is relatively more efficient in getting more knowledge than small ones.
What’s more, under normal circumstances, when increasing the LLM model parameters, the amount of training data will often increase simultaneously, which means that large models can learn more knowledge points from more data. These studies can explain the above figure, why as the model size increases, these knowledge-intensive tasks become better and better.
Most traditional NLP tasks are actually knowledge-intensive tasks, and many tasks have achieved great improvement in the past few years, even surpassing human performance. Obviously, this is most likely caused by the increase in the scale of the LLM model, rather than due to a specific technical improvement.
Breakthroughs Tasks
The second type of task demonstrates that LLM has some kind of “Emergent Ability”, as shown in (b) above. The so-called “emergent ability” means that when the model parameter scale fails to reach a certain threshold, the model basically does not have any ability to solve such tasks, which reflects that its performance is equivalent to randomly selecting answers. However, when the model scale spans Once the threshold is exceeded, the LLM model’s effect on such tasks will experience a sudden performance increase.
In other words, model size is the key to unlocking (unlocking) new capabilities of LLM. As the model size becomes larger and larger, more and more new capabilities of LLM will be gradually unlocked.
This is a very magical phenomenon because it means the following possibilities that make people optimistic about the future. Many tasks that cannot be solved well by LLM at present can be solved in future if we continue to make the model larger. Because LLM has “emergent capabilities” to suddenly unlock those limits one day. The growth of the LLM model will bring us unexpected and wonderful gifts.
The article “Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models” points out that tasks that embody “emergent capabilities” also have some common features: these tasks generally consist of multiple steps, and to solve these tasks, it is often necessary to first Multiple intermediate steps are solved, and logical reasoning skills play an important role in the final solution of such tasks.
Chain of Thought (CoT) Prompting is a typical technology that enhances the reasoning ability of LLM, which can greatly improve the effect of such tasks. I will discuss the CoT technology in the following blogs.
Here the most important question is, why does LLM have this “emergent ability” phenomenon? The article “Emergent Abilities of Large Language Models” shares several possible explanations:
One possible explanation is that the evaluation indicators of some tasks are not smooth enough. For example, some metrics for generation tasks require that the string output by the model must completely match the standard answer to be considered correct otherwise it will be scored zero.
Thus, even as the model gradually becomes better and outputs more correct character fragments, because it is not completely correct, 0 points will be given for any small errors. Only when the model is large enough, the output Scores are scored when all the output segments are correct. In other words, because the indicator is not smooth enough, it cannot reflect the reality that LLM is actually gradually improving its performance on the task. It seems to be an external manifestation of “emergent ability”.
Another possible explanation is that some tasks are composed of several intermediate steps. As the size of the model increases, the ability to solve each step gradually increases, but as long as one intermediate step is wrong, the final answer will be wrong. This will also lead to this superficial “emergent ability” phenomenon.
Of course, the above explanations are still conjectures at present. As for why LLM has this phenomenon, further and in-depth research is needed.
There are also a small number of tasks. As the model size increases, the task effect curve shows U-shaped characteristics: as the model size gradually increases, the task effect gradually becomes worse, but when the model size further increases, the effect starts to get better and better. Figure above shows a U-shaped growth trend where the indicator trend of the pink PaLM model on the two tasks.
Why do these tasks appear so special? The article “Inverse Scaling Can Become U-shaped” gives an explanation:
These tasks actually contain two different types of subtasks, one is the real task, and the other is the “interference task ( distractor task)”.
When the model size is small, it cannot identify any sub-task, so the performance of the model is similar to randomly selecting answers.
When the model grows to a medium size, it mainly tries to solve the interference task, so it has a negative impact on the real task performance. This is reflected in the decline of the real task effect.
When the model size is further increased, LLM can ignore the interfering task and perform the real task, which is reflected in the effect starting to grow.
For those tasks whose performance has been declining as the model size increases, if Chain of Thought (CoT) Prompting is used, the performance of some tasks will be converted to follow the Scaling Law. That is, the larger the model size, the better the performance, while other tasks will be converted to a U-shaped growth curve.
This actually shows that this type of task should be a reasoning-type task, so the task performance will change qualitatively after adding CoT.
Personal View
Increasing the size of the LLM model may not seem technically significant, but it is actually very important to build better LLMs. In my opinion, the advancements from Bert to GPT 3 and ChatGPT are likely attributed to the growth of the LLM model size rather than a specific technology. I believe a lot of people want to explore the scale ceiling of the LLM model if possible.
The key to achieving AGI may lie in having large and diverse data, large-scale models, and rigorous training processes. Developing such large LLM models requires high engineering skills from the technical team, which means there is technical content involved.
Increasing the scale of the LLM model has research significance. There are two main reasons why it is valuable.
Firstly, as the model size grows, the performance of various tasks improves, especially for knowledge-intensive tasks. Additionally, for reasoning and difficult tasks, the effect of adding CoT Prompting follows a scaling law. Therefore, it is important to determine to what extent the scale effect of LLM can solve these tasks.
Secondly, the “emergent ability” of LLM suggests that increasing the model size may unlock new capabilities that we did not expect. This raises the question of what these capabilities could be.
Considering these factors, it is necessary to continue increasing the model size to explore the limits of its ability to solve different tasks.
Talk is cheap, and in reality, very few AI/ML practitioners have the opportunity or ability to build larger models due to high financial requirements, investment willingness, engineering capabilities, and technical enthusiasm from research institutions. There are probably no more than 10 institutions that can do this on Earth. However, in the future, there may be a possibility of joint efforts between capable institutions to build a Super-Large model:
All (Resources) for One (Model) and One (Model) for All (People).
Modified from Alexandre Dumas, The Three Musketeers
BIG-bench Project Team: 2023: Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models (https://arxiv.org/abs/2206.04615)