Our Future with AI: Three Strategies to Ensure It Stays on Our Side

As Artificial Intelligence rapidly evolves, ensuring it remains a beneficial tool rather than a source of unforeseen challenges is paramount; this article explores three critical strategies to keep AI firmly on our side. Our AI researchers can draw lessons from cybersecurity, robotics, and astrobiology side. Source: IEEE Spectrum April 2025; 3 Ways to Keep AI on Our Side: AI Researchers can Draw Lessons from Cybersecurity, Robotics, and Astrobiology

Play the podcast

中文翻译摘要

这篇文章提出了确保人工智能安全和有益发展的三个独特且跨学科的策略。

应对人工智能的独特错误模式:布鲁斯·施奈尔(Bruce Schneier)和内森·E·桑德斯(Nathan E. Sanders)(网络安全视角)指出,人工智能系统,特别是大型语言模型(LLMs),其错误模式与人类错误显著不同——它们更难预测,不集中在知识空白处,且缺乏对自身错误的自我意识。他们提出双重研究方向:一是工程化人工智能以产生更易于人类理解的错误(例如,通过RLHF等精炼的对齐技术);二是开发专门针对人工智能独特“怪异”之处的新型安全与纠错系统(例如,迭代且多样化的提示)。

更新伦理框架以打击人工智能欺骗:达里乌什·杰米尔尼亚克(Dariusz Jemielniak)(机器人与互联网文化视角)认为,鉴于人工智能驱动的欺骗行为(包括深度伪造、复杂的错误信息宣传和操纵性人工智能互动)日益增多,艾萨克·阿西莫夫(Isaac Asimov)传统的机器人三定律已不足以应对现代人工智能。他提出一条“机器人第四定律”:机器人或人工智能不得通过冒充人类来欺骗人类。实施这项法律将需要强制性的人工智能披露、清晰标注人工智能生成内容、技术识别标准、法律执行以及公众人工智能素养倡议,以维护人机协作中的信任。

建立通用人工智能(AGI)检测与互动的严格协议:埃德蒙·贝戈利(Edmon Begoli)和阿米尔·萨多夫尼克(Amir Sadovnik)(天体生物学/SETI视角)建议,通用人工智能(AGI)的研究可以借鉴搜寻地外文明(SETI)的方法论。他们主张对AGI采取结构化的科学方法,包括:制定清晰、多学科的“通用智能”及相关概念(如意识)定义;创建超越图灵测试局限性的鲁棒、新颖的AGI检测指标和评估基准;以及制定国际公认的检测后协议,以便在AGI出现时进行验证、确保透明度、安全性和伦理考量。

总而言之,这些观点强调了迫切需要创新、多方面的方法——涵盖安全工程、伦理准则修订以及严格的科学协议制定——以主动管理先进人工智能系统的社会融入和潜在未来轨迹。


Abstract: this article presents three distinct, cross-disciplinary strategies for ensuring the safe and beneficial development of Artificial Intelligence.

Addressing Idiosyncratic AI Error Patterns (Cybersecurity Perspective): Bruce Schneier and Nathan E. Sanders highlight that AI systems, particularly Large Language Models (LLMs), exhibit error patterns significantly different from human mistakes—being less predictable, not clustered around knowledge gaps, and lacking self-awareness of error. They propose a dual research thrust: engineering AIs to produce more human-intelligible errors (e.g., through refined alignment techniques like RLHF) and developing novel security and mistake-correction systems specifically designed for AI’s unique “weirdness” (e.g., iterative, varied prompting).

Updating Ethical Frameworks to Combat AI Deception (Robotics & Internet Culture Perspective): Dariusz Jemielniak argues that Isaac Asimov’s traditional Three Laws of Robotics are insufficient for modern AI due to the rise of AI-enabled deception, including deepfakes, sophisticated misinformation campaigns, and manipulative AI interactions. He proposes a “Fourth Law of Robotics”: A robot or AI must not deceive a human being by impersonating a human being. Implementing this law would necessitate mandatory AI disclosure, clear labeling of AI-generated content, technical identification standards, legal enforcement, and public AI literacy initiatives to maintain trust in human-AI collaboration.

Establishing Rigorous Protocols for AGI Detection and Interaction (Astrobiology/SETI Perspective): Edmon Begoli and Amir Sadovnik suggest that research into Artificial General Intelligence (AGI) can draw methodological lessons from the Search for Extraterrestrial Intelligence (SETI). They advocate for a structured scientific approach to AGI that includes:

  • Developing clear, multidisciplinary definitions of “general intelligence” and related concepts like consciousness.
  • Creating robust, novel metrics and evaluation benchmarks for detecting AGI, moving beyond limitations of tests like the Turing Test.
  • Formulating internationally recognized post-detection protocols for validation, transparency, safety, and ethical considerations, should AGI emerge.

Collectively, these perspectives emphasize the urgent need for innovative, multi-faceted approaches—spanning security engineering, ethical guideline revision, and rigorous scientific protocol development—to proactively manage the societal integration and potential future trajectory of advanced AI systems.


Here are the full detailed content:

3 Ways to Keep AI on Our Side

AS ARTIFICIAL INTELLIGENCE reshapes society, our traditional safety nets and ethical frameworks are being put to the test. How can we make sure that AI remains a force for good? Here we bring you three fresh visions for safer AI.

  • In the first essay, security expert Bruce Schneier and data scientist Nathan E. Sanders explore how AI’s “weird” error patterns create a need for innovative security measures that go beyond methods honed on human mistakes.
  • Dariusz Jemielniak, an authority on Internet culture and technology, argues that the classic robot ethics embodied in Isaac Asimov’s famous rules of robotics need an update to counterbalance AI deception and a world of deepfakes.
  • And in the final essay, the AI researchers Edmon Begoli and Amir Sadovnik suggest taking a page from the search for intelligent life in the stars; they propose rigorous standards for detecting the possible emergence of human-level AI intelligence.

As AI advances with breakneck speed, these cross-disciplinary strategies may help us keep our hands on the reins.


AI Mistakes Are Very Different from Human Mistakes

WE NEED NEW SECURITY SYSTEMS DESIGNED TO DEAL WITH THEIR WEIRDNESS

Bruce Schneier & Nathan E. Sanders

HUMANS MAKE MISTAKES all the time. All of us do, every day, in tasks both new and routine. Some of our mistakes are minor, and some are catastrophic. Mistakes can break trust with our friends, lose the confidence of our bosses, and sometimes be the difference between life and death.

Over the millennia, we have created security systems to deal with the sorts of mistakes humans commonly make. These days, casinos rotate their dealers regularly, because they make mistakes if they do the same task for too long. Hospital personnel write on patients’ limbs before surgery so that doctors operate on the correct body part, and they count surgical instruments to make sure none are left inside the body. From copyediting to double-entry bookkeeping to appellate courts, we humans have gotten really good at preventing and correcting human mistakes.

Humanity is now rapidly integrating a wholly different kind of mistakemaker into society: AI. Technologies like large language models (LLMs) can perform many cognitive tasks traditionally fulfilled by humans, but they make plenty of mistakes. You may have heard about chatbots telling people to eat rocks or add glue to pizza. What differentiates AI systems’ mistakes from human mistakes is their weirdness. That is, AI systems do not make mistakes in the same ways that humans do.

Much of the risk associated with our use of AI arises from that difference. We need to invent new security systems that adapt to these differences and prevent harm from AI mistakes.

IT’S FAIRLY EASY to guess when and where humans will make mistakes. Human errors tend to come at the edges of someone’s knowledge: Most of us would make mistakes solving calculus problems. We expect human mistakes to be clustered: A single calculus mistake is likely to be accompanied by others. We expect mistakes to wax and wane depending on factors such as fatigue and distraction. And mistakes are typically accompanied by ignorance: Someone who makes calculus mistakes is also likely to respond “I don’t know” to calculus-related questions.

To the extent that AI systems make these humanlike mistakes, we can bring all of our mistake-correcting systems to bear on their output. But the current crop of AI models—particularly LLMs—make mistakes differently.

AI errors come at seemingly random times, without any clustering around particular topics. The mistakes tend to be more evenly distributed through the knowledge space; an LLM might be equally likely to make a mistake on a calculus question as it is to propose that cabbages eat goats. And AI mistakes aren’t accompanied by ignorance. An LLM will be just as confident when saying something completely and obviously wrong as it will be when saying something true.

The inconsistency of LLMs makes it hard to trust their reasoning in complex, multistep problems. If you want to use an AI model to help with a business problem, it’s not enough to check that it understands what factors make a product profitable; you need to be sure it won’t forget what money is.

THIS SITUATION INDICATES two possible areas of research: engineering LLMs to make mistakes that are more humanlike, and building new mistake-correcting systems that deal with the specific sorts of mistakes that LLMs tend to make.

We already have some tools to lead LLMs to act more like humans. Many of these arise from the field of “alignment” research, which aims to make models act in accordance with the goals of their human developers. One example is the technique that was arguably responsible for the breakthrough success of ChatGPT: reinforcement learning with human feedback. In this method, an AI model is rewarded for producing responses that get a thumbs-up from human evaluators. Similar approaches could be used to induce AI systems to make humanlike mistakes, particularly by penalizing them more for mistakes that are less intelligible.

When it comes to catching AI mistakes, some of the systems that we use to prevent human mistakes will help. To an extent, forcing LLMs to double-check their own work can help prevent errors. But LLMs can also confabulate seemingly plausible yet truly ridiculous explanations for their flights from reason.

Other mistake-mitigation systems for AI are unlike anything we use for humans. Because machines can’t get fatigued or frustrated, it can help to ask an LLM the same question repeatedly in slightly different ways and then synthesize its responses. Humans won’t put up with that kind of annoying repetition, but machines will.

RESEARCHERS ARE still struggling to understand where LLM mistakes diverge from human ones. Some of the weirdness of AI is actually more humanlike than it first appears.

Small changes to a query to an LLM can result in wildly different responses, a problem known as prompt sensitivity. But, as any survey researcher can tell you, humans behave this way, too. The phrasing of a question in an opinion poll can have drastic impacts on the answers.

LLMs also seem to have a bias toward repeating the words that were most common in their training data—for example, guessing familiar place names like “America” even when asked about more exotic locations. Perhaps this is an example of the human “availability heuristic” manifesting in LLMs; like humans, the machines spit out the first thing that comes to mind rather than reasoning through the question. Also like humans, perhaps, some LLMs seem to get distracted in the middle of long documents; they remember more facts from the beginning and end.

In some cases, what’s bizarre about LLMs is that they act more like humans than we think they should. Some researchers have tested the hypothesis that LLMs perform better when offered a cash reward or threatened with death. It also turns out that some of the best ways to “jailbreak” LLMs (getting them to disobey their creators’ explicit instructions) look a lot like the kinds of social-engineering tricks that humans use on each otherfor example, pretending to be someone else or saying that the request is just a joke. But other effective jailbreaking techniques are things no human would ever fall for. One group found that if they used ASCII art (constructions of symbols that look like words or pictures) to pose dangerous questions, like how to build a bomb, the LLM would answer them willingly.

Humans may occasionally make seemingly random, incomprehensible, and inconsistent mistakes, but such occurrences are rare and often indicative of more serious problems. We also tend not to put people exhibiting these behaviors in decision-making positions. Likewise, we should confine AI decision-making systems to applications that suit their actual abilities—while keeping the potential ramifications of their mistakes firmly in mind.


Asimov’s Laws of Robotics Need an Update for AI PROPOSING A FOURTH LAW OF ROBOTICS

Dariusz Jemielniak

IN 1942, the legendary science fiction author Isaac Asimov introduced his Three Laws of Robotics in his short story “Runaround.” The laws were later popularized in his seminal story collection I, Robot.

  1. FIRST LAW: A robot may not injure a human being or, through inaction, allow a human being to come to harm.
  2. SECOND LAW: A robot must obey the orders given it by human beings except where such orders would conflict with the First Law.
  3. THIRD LAW: A robot must protect its own existence as long as such protection does not conflict with the First or Second Law.

While drawn from works of fiction, these laws have shaped discussions of robot ethics for decades. And as AI systems—which can be considered virtual robots—have become more sophisticated and pervasive, some technologists have found Asimov’s framework useful for considering the potential safeguards needed for AI that interacts with humans.

But the existing three laws are not enough. Today, we are entering an era of unprecedented human-AI collaboration that Asimov could hardly have envisioned. The rapid advancement of generative AI, particularly in language and image generation, has created challenges beyond Asimov’s original concerns about physical harm and obedience.

THE PROLIFERATION of AI-enabled deception is particularly concerning. According to the FBI’s most recent Internet Crime Report, cybercrime involving digital manipulation and social engineering results in annual losses counted in the billions. The European Union Agency for Cybersecurity’s ENISA Threat Landscape 2023 highlighted deepfakes—synthetic media that appear genuine—as an emerging threat to digital identity and trust.

Social-media misinformation is a huge problem today. I studied it during the pandemic extensively and can say that the proliferation of generative AI tools has made its detection increasingly difficult. AI-generated propaganda is often just as persuasive as or even more persuasive than traditional propaganda, and bad actors can very easily use AI to create convincing content. Deepfakes are on the rise everywhere. Botnets can use AI-generated text, speech, and video to create false perceptions of widespread support for any political issue. Bots are now capable of making phone calls while impersonating people, and AI scam calls imitating familiar voices are increasingly common. Any day now, we can expect a boom in video-call scams based on AI-rendered overlay avatars, allowing scammers to impersonate loved ones and target the most vulnerable populations.

Even more alarmingly, children and teenagers are forming emotional attachments to AI agents, and are sometimes unable to distinguish between interactions with real friends and bots online. Already, there have been suicides attributed to interactions with AI chatbots.

In his 2019 book Human Compatible (Viking), the eminent computer scientist Stuart Russell argues that AI systems’ ability to deceive humans represents a fundamental challenge to social trust. This concern is reflected in recent policy initiatives, most notably the European Union’s AI Act, which includes provisions requiring transparency in AI interactions and transparent disclosure of AI-generated content. In Asimov’s time, people couldn’t have imagined the countless ways in which artificial agents could use online communication tools and avatars to deceive humans.

Therefore, we must make an addition to Asimov’s laws.

FOURTH LAW: A robot or AI must not deceive a human being by impersonating a human being.

WE NEED CLEAR BOUNDARIES. While human-AI collaboration can be constructive, AI deception undermines trust and leads to wasted time, emotional distress, and misuse of resources. Artificial agents must identify themselves to ensure our interactions with them are transparent and productive. AI-generated content should be clearly marked unless it has been significantly edited and adapted by a human.

Implementation of this Fourth Law would require

  • mandatory AI disclosure in direct interactions,
  • clear labeling of AI-generated content,
  • technical standards for AI identification,
  • legal frameworks for enforcement, and
  • educational initiatives to improve AI literacy.

Of course, all this is easier said than done. Enormous research efforts are already underway to find reliable ways to watermark or detect AI-generated text, audio, images, and videos. But creating the transparency I’m calling for is far from a solved problem.

The future of human-AI collaboration depends on maintaining clear distinctions between human and artificial agents. As noted in the IEEE report Ethically Aligned Design, transparency in AI systems is fundamental to building public trust and ensuring the responsible development of artificial intelligence.

Asimov’s complex stories showed that even robots that tried to follow the rules often discovered there were unintended consequences to their actions. Still, having AI systems that are at least trying to follow Asimov’s ethical guidelines would be a very good start.


What Can AI Researchers Learn from Alien Hunters?

THE SETI INSTITUTE’S APPROACH HAS LESSONS FOR RESEARCH ON ARTIFICIAL GENERAL INTELLIGENCE

Edmon Begoli & Amir Sadovnik

THE EMERGENCE OF artificial general intelligence (systems that can perform any intellectual task a human can) could be the most important event in human history. Yet AGI remains an elusive and controversial concept. We lack a clear definition of what it is, we don’t know how to detect it, and we don’t know how to interact with it if it finally emerges.

What we do know is that today’s approaches to studying AGI are not nearly rigorous enough. Companies like OpenAI are actively striving to create AGI, but they include research on AGI’s social dimensions and safety issues only as their corporate leaders see fit. And academic institutions don’t have the resources for significant efforts.

We need a structured scientific approach to prepare for AGI. A useful model comes from an unexpected field: the search for extraterrestrial intelligence, or SETI. We believe that the SETI Institute’s work provides a rigorous framework for detecting and interpreting signs of intelligent life.

The idea behind SETI goes back to the beginning of the space age. In their 1959 Nature paper, the physicists Giuseppe Cocconi and Philip Morrison suggested ways to search for interstellar communication. Given the uncertainty of extraterrestrial civilizations’ existence and sophistication, they theorized about how we should best “listen” for messages from alien societies.

We argue for a similar approach to studying AGI, in all its uncertainties. The last few years have shown a vast leap in AI capabilities. The large language models (LLMs) that power chatbots like ChatGPT and enable them to converse convincingly with humans have renewed the discussion of AGI. One notable 2023 preprint even argued that ChatGPT shows “sparks” of AGI, and today’s most cutting-edge language models are capable of sophisticated reasoning and outperform humans in many evaluations.

While these claims are intriguing, there are reasons to be skeptical. In fact, a large group of scientists have argued that the current set of tools won’t bring us any closer to true AGI. But given the risks associated with AGI, if there is even a small likelihood of it occurring, we must make a serious effort to develop a standard definition of AGI, establish a SETI-like approach to detecting it, and devise ways to safely interact with it if it emerges.

THE CRUCIAL FIRST step is to define what exactly to look for. In SETI’s case, researchers decided to look for certain narrowband signals that would be distinct from other radio signals present in the cosmic background. These signals are considered intentional and only produced by intelligent life. None have been found so far.

In the case of AGI, matters are far more complicated. Today, there is no clear definition of artificial general intelligence. The term is hard to define because it contains other imprecise and controversial terms. Although intelligence has been defined by the Oxford English Dictionary as “the ability to acquire and apply knowledge and skills,” there is still much debate on which skills are involved and how they can be measured. The term general is also ambiguous. Does an AGI need to be able to do absolutely everything a human can do?

One of the first missions of a “SETI for AGI” project must be to clearly define the terms general and intelligence so the research community can speak about them concretely and consistently. These definitions need to be grounded in disciplines such as computer science, measurement science, neuroscience, psychology, mathematics, engineering, and philosophy.

There’s also the crucial question of whether a true AGI must include consciousness and self-awareness. These terms also have multiple definitions, and the relationships between them and intelligence must be clarified. Although it’s generally thought that consciousness isn’t necessary for intelligence, it’s often intertwined with discussions of AGI because creating a self-aware machine would have many philosophical, societal, and legal implications.

NEXT COMES the task of measurement. In the case of SETI, if a candidate narrowband signal is detected, an expert group will verify that it is indeed from an extraterrestrial source. They’ll use established criteria—for example, looking at the signal type and checking for repetition—and conduct assessments at multiple facilities for additional validation.

How to best measure computer intelligence has been a long-standing question in the field. In a famous 1950 paper, Alan Turing proposed the “imitation game,” more widely known as the Turing Test, which assesses whether human interlocutors can distinguish if they are chatting with a human or a machine. Although the Turing Test was useful in the past, the rise of LLMs has made clear that it isn’t a complete enough test to measure intelligence. As Turing himself noted, the relationship between imitating language and thinking is still an open question.

Future appraisals must be directed at different dimensions of intelligence. Although measures of human intelligence are controversial, IQ tests can provide an initial baseline to assess one dimension. In addition, cognitive tests on topics such as creative problem-solving, rapid learning and adaptation, reasoning, and goal-directed behavior would be required to assess general intelligence.

But it’s important to remember that these cognitive tests were designed for humans and might contain assumptions that might not apply to computers, even those with AGI abilities. For example, depending on how it’s trained, a machine may score very high on an IQ test but remain unable to solve much simpler tasks. In addition, an AI may have new abilities that aren’t measurable by our traditional tests. There’s a clear need to design novel evaluations that can alert us when meaningful progress is made toward AGI.

IF WE DEVELOP AGI, we must be prepared to answer questions such as: Is the new form of intelligence a new form of life? What kinds of rights does it have? What are the potential safety concerns, and what is our approach to containing the AGI entity?

Here, too, SETI provides inspiration. SETI’s postdetection protocols emphasize validation, transparency, and international cooperation, with the goal of maximizing the credibility of the process, minimizing sensationalism, and bringing structure to such a profound event. Likewise, we need internationally recognized AGI protocols to bring transparency to the entire process, apply safety-related best practices, and begin the discussion of ethical, social, and philosophical concerns.

We readily acknowledge that the SETI analogy can go only so far. If AGI emerges, it will be a human-made phenomenon. We will likely gradually engineer AGI and see it slowly emerge, so detection might be a process that takes place over a period of years, if not decades. In contrast, the existence of extraterrestrial life is something that we have no control over, and contact could happen very suddenly.

The consequences of a true AGI are entirely unpredictable. To best prepare, we need a methodical approach to defining, detecting, and interacting with AGI, which could be the most important development in human history.


2024 Guest Lecture Notes: AI, Machine Learning and Data Mining in Recommendation System and Entity Matching

  1. Lecture Notes Repository on GitHub
    1. Disclaimer
    2. 2024-10-14: AI/ML in Action for CSE5ML
    3. 2024-10-15: AI/DM in Action for CSE5DMI
  2. Contribution to the Company and Society
  3. Reference

In October of 2024, I was invited by Dr Lydia C. and Dr Peng C to give two presentations as a guest lecturer at La Trobe University (Melbourne) to the students enrolled with CSE5DMI Data Mining and CSE5ML Machine Learning.

The lectures are focused on data mining and machine learning applications and practice in industry and digital retail; and how students should prepare themselves for their future. Attendees are postgraduate students currently enrolled in CSE5ML or CSE5DMI in 2024 Semester 2, approximately 150 students for each subject (CSE5ML or CSE5DMI) who are pursuing one of the following degrees:

  • Master of Information Technology (IT)
  • Master of Artificial Intelligence (AI)
  • Master of Data Science
  • Master of Business Analytics

Lecture Notes Repository on GitHub

Viewer can find the Lecture Notes on my GitHub Repository: https://github.com/cuicaihao/GuestLecturePublic under a Creative Commons Attribution 4.0 International License.

Disclaimer

This repository is intended for educational purposes only. The content, including presentations and case studies, is provided “as is” without any warranties or guarantees of any kind. The authors and contributors are not responsible for any errors or omissions, or for any outcomes related to the use of this material. Use the information at your own risk. All trademarks, service marks, and company names are the property of their respective owners. The inclusion of any company or product names does not imply endorsement by the authors or contributors.

This is public repository aiming to share the lecture for the public. The *.excalidraw files can be download and open on https://excalidraw.com/)

2024-10-14: AI/ML in Action for CSE5ML

  • General Slides CSE5ML
  • Case Study: Recommendation System
  • A recommendation system is an artificial intelligence or AI algorithm, usually associated with machine learning, that uses Big Data to suggest or recommend additional products to consumers. These can be based on various criteria, including past purchases, search history, demographic information, and other factors.
  • This presentation is developed for students of CSE5ML LaTrobe University, Melbourne and used in the guest lecture on 2024 October 14.

2024-10-15: AI/DM in Action for CSE5DMI

  • General Slides CSE5DMI
  • Case Study: Entity Matching System
    • Entity matching – the task of clustering duplicated database records to underlying entities.”Given a large collection of records, cluster these records so that the records in each cluster all refer to the same underlying entity.”
  • This presentation is developed for students of CSE5DMI LaTrobe University, Melbourne and used in the guest lecture on 2024 October 15.

Contribution to the Company and Society

This journey is also align to the Company’s strategy.

  • Being invited to be a guest lecturer for students with related knowledge backgrounds in 2024 aligns closely with EDG’s core values of “weʼre real, weʼre inclusive, weʼre responsible”.
  • By participating in a guest lecture and discussion on data analytics and AI/ML practice beyond theories, we demonstrate our commitment to sharing knowledge and expertise, embodying our responsibility to contribute positively to the academic community and bridge the gap between theory builders and problem solvers.
  • This event allows us to inspire and educate students in the same domains at La Trobe University, showcasing our passion and enthusiasm for the business. Through this engagement, we aim to positively impact attendees, providing suggestions for their career paths, and fostering a spirit of collaboration and continuous learning.
  • Showing our purpose, values, and ways of working will impress future graduates who may want to come and work for us, want to stay and thrive with us. It also helps us deliver on our purpose to create a more sociable future, together.

Moreover, I am grateful for all the support and encouragement I have received from my university friends and teammates throughout this journey. Additionally, the teaching resources and environment in the West Lecture Theatres at La Trobe University are outstanding!

Reference

-END-

AI Revolutionizes Industry and Retail: From Production Lines to Personalized Shopping Experiences

  1. Industry and Retail Relationship
  2. AI in Industry
  3. AI in Retail
  4. Summary

AI technology is increasingly being utilized in industry and retail sectors to enhance efficiency, productivity, and customer experiences. In this post, we firstly revisit the relationship between the industry and retail sections, then provide some common AI technologies and applications used in these domains.

Industry and Retail Relationship

The key difference between industry and retail lies in their primary functions and the nature of their operations:

Industry:

  • Industry, often referred to as manufacturing or production, involves the creation, extraction, or processing of raw materials and the transformation of these materials into finished goods or products.
  • Industrial businesses are typically involved in activities like manufacturing, mining, construction, or agriculture.
  • The primary focus of the industry is to produce goods on a large scale, which are then sold to other businesses, wholesalers, or retailers. These goods are often used as inputs for other industries or for further processing.
  • Industries may have complex production processes, rely on machinery and technology, and require substantial capital investment.

Retail:

  • Retail, on the other hand, involves the sale of finished products or goods directly to the end consumers for personal use. Retailers act as intermediaries between manufacturers or wholesalers and the end customers.
  • Retailers can take various forms, including physical stores, e-commerce websites, supermarkets, boutiques, and more.
  • Retailers may carry a wide range of products, including those manufactured by various industries. They focus on providing a convenient and accessible point of purchase for consumers.
  • Retail operations are primarily concerned with merchandising, marketing, customer service, inventory management, and creating a satisfying shopping experience for consumers.

AI in Industry

AI, or artificial intelligence, is revolutionizing industry sectors by powering various applications and technologies that enhance efficiency, productivity, and customer experiences. Here are some common AI technologies and applications used in these domains:

1. Robotics and Automation: AI-driven robots and automation systems are used in manufacturing to perform repetitive, high-precision tasks, such as assembly, welding, and quality control. Machine learning algorithms enable these robots to adapt and improve their performance over time.

2. Predictive Maintenance: AI is used to predict when industrial equipment, such as machinery or vehicles, is likely to fail. This allows companies to schedule maintenance proactively, reducing downtime and maintenance costs.

3. Quality Control: Computer vision and machine learning algorithms are employed for quality control processes. They can quickly identify defects or irregularities in products, reducing the number of faulty items reaching the market.

4. Supply Chain Optimization: AI helps in optimizing the supply chain by predicting demand, managing inventory, and optimizing routes for logistics and transportation.

5. Process Optimization: AI can optimize manufacturing processes by adjusting parameters in real time to increase efficiency and reduce energy consumption.

6. Safety and Compliance: AI-driven systems can monitor and enhance workplace safety, ensuring that industrial facilities comply with regulations and safety standards.


AI in Retail

AI technology is revolutionizing the retail sector too, introducing innovative solutions and transforming the way businesses engage with customers. Here are some key AI technologies and applications used in retail:

1. Personalized Marketing: AI is used to analyze customer data and behaviours to provide personalized product recommendations, targeted marketing campaigns, and customized shopping experiences.

2. Chatbots and Virtual Assistants: Retailers employ AI-powered chatbots and virtual assistants to provide customer support, answer queries, and assist with online shopping.

3. Inventory Management: AI can optimize inventory levels and replenishment by analyzing sales data and demand patterns, reducing stockouts and overstock situations.

4. Price Optimization: Retailers use AI to dynamically adjust prices based on various factors, such as demand, competition, and customer behaviour, to maximize revenue and profits.

5. Visual Search and Image Recognition: AI enables visual search in e-commerce, allowing customers to find products by uploading images or using images they find online.

6. Supply Chain and Logistics: AI helps optimize supply chain operations, route planning, and warehouse management, improving efficiency and reducing costs.

7. In-Store Analytics: AI-powered systems can analyze in-store customer behaviour, enabling retailers to improve store layouts, planogram designs, and customer engagement strategies.

8. Fraud Detection: AI is used to detect and prevent fraudulent activities, such as credit card fraud and return fraud, to protect both retailers and customers.

Summary

AI’s potential to transform industry and retail is huge and its future applications are very promising. As AI technologies advance, we can expect increased levels of automation, personalization, and optimization in industry and retail operations.

AI technologies in these sectors often rely on machine learning (ML), deep learning (DL), natural language processing (NLP), and computer vision (CV), and now Generative Large Language Models (LLM) to analyze and gain insights from data. These AI applications are continuously evolving and are changing the way businesses in these sectors operate, leading to improved processes and customer experiences.

AI will drive high levels of efficiency, innovation, and customer satisfaction in these sectors, ultimately revolutionizing the way businesses operate and interact with consumers.


Technical Review 02: Data and Knowlege for Large Language Model (LLM)

  1. AI Assistant Summary
  2. Introduction
  3. Unveiling the Depth of LLM Knowledge
  4. The Repository of Knowledge: How LLM Stores and Retrieves Information
  5. Knowledge Correction in LLM: Adapting to Evolving Information
  6. Methods for Modifying Knowledge in LLM
    1. Correcting Knowledge at the Source
    2. Fine-Tuning to Correct Knowledge
    3. Directly Modifying Model Parameters
  7. What’s Next

Previous: Technical Review 01: Large Language Model (LLM) and NLP Research Paradigm Transformation

AI Assistant Summary

This blog explores the depth of knowledge acquired by Large Language Models (LLMs), such as the Transformer. The knowledge obtained can be categorized into linguistic knowledge and factual knowledge. Linguistic knowledge includes understanding language structure and rules, while factual knowledge encompasses real-world events and common-sense notions.

The blog explains that LLMs acquire linguistic knowledge at various levels, with more fundamental language elements residing in lower and mid-level structures, and abstract language knowledge distributed across mid-level and high-level structures. In terms of factual knowledge, LLMs absorb a significant amount of it, mainly in the mid and high levels of the Transformer model.

The blog also addresses how LLMs store and retrieve knowledge. It suggests that the feedforward neural network (FFN) layers in the Transformer serve as a Key-Value memory system, housing specific knowledge. The FFN layers detect knowledge patterns through the Key layer and retrieve corresponding values from the Value layer to generate output.

Furthermore, the blog discusses the feasibility of correcting erroneous or outdated knowledge within LLMs. It introduces three methods for modifying knowledge in LLMs:

  1. Correcting knowledge at the source by identifying and adjusting training data;
  2. Fine-tuning the model with new training data containing desired corrections;
  3. Directly Modifying model parameters associated with specific knowledge.

These methods aim to enhance the reliability and relevance of LLMs in providing up-to-date and accurate information. The blog emphasizes the importance of adapting and correcting knowledge in LLMs to keep pace with evolving information in real-world scenarios.

In conclusion, this blog sheds light on the depth of knowledge acquired by LLMs, how it is stored and retrieved, and strategies for correcting and adapting knowledge within these models. Understanding these aspects contributes to harnessing the full potential of LLMs in various applications.

Introduction

Judging from the current Large Language Model (LLM) research results, the Transformer is a sufficiently powerful feature extractor and does not require special improvements.

So what did Transformer learn through the pre-training process?

How is knowledge accessed?

How do we correct incorrect knowledge?

This blog discusses the research progress in this area.

Unveiling the Depth of LLM Knowledge

Large Language Models (LLM) acquire a wealth of knowledge from extensive collections of free text. This knowledge can be broadly categorized into two realms: linguistic knowledge and Factual knowledge.

Linguistic Knowledge

  • This encompasses understanding the structure and rules of language, including morphology, parts of speech, syntax, and semantics. Extensive research has affirmed the capacity of LLM to grasp various levels of linguistic knowledge.
  • Since the emergence of models like Bert, numerous experiments have validated this capability. The acquisition of such linguistic knowledge is pivotal, as it substantially enhances LLM’s performance in various natural language understanding tasks following pre-training.
  • Additionally, investigations have shown that more fundamental language elements like morphology, parts of speech, and syntax reside in the lower and mid-level structures of the Transformer, while abstract language knowledge, such as semantics, is distributed across the mid-level and high-level structures.

Factual Knowledge

  • This category encompasses both factual knowledge, relating to real-world events, and common-sense knowledge.
  • Examples include facts like “Biden is the current President of the United States” and common-sense notions like “People have two eyes.”
  • Numerous studies have explored the extent to which LLM models can absorb world knowledge, and the consensus suggests that they indeed acquire a substantial amount of it from their training data.
  • This knowledge tends to be concentrated primarily in the mid and high levels of the Transformer model. Notably, as the depth of the Transformer model increases, its capacity to learn and retain knowledge expands exponentially. LLM can be likened to an implicit knowledge graph stored within its model parameters.

A study titled “When Do You Need Billions of Words of Pre-training Data?” delves into the relationship between the volume of pre-training data and the knowledge acquired by the model.

The conclusion drawn is that for Bert-type language models, a corpus containing 10 million to 100 million words suffices to learn linguistic knowledge, including syntax and semantics. However, to grasp factual knowledge, a more substantial volume of training data is required.

This is logical, given that linguistic knowledge is relatively finite and static, while factual knowledge is vast and constantly evolving. Current research demonstrates that as the amount of training data increases, the pre-trained model exhibits enhanced performance across a range of downstream tasks, emphasizing that the incremental data mainly contributes to the acquisition of world knowledge.

The Repository of Knowledge: How LLM Stores and Retrieves Information

As discussed earlier, Large Language Models (LLM) accumulate an extensive reservoir of language and world knowledge from their training data. But where exactly is this knowledge stored within the model, and how does LLM access it? These are intriguing questions worth exploring.

Evidently, this knowledge is stored within the model parameters of the Transformer architecture. The model parameters can be divided into two primary components: the multi-head attention (MHA) segment, which constitutes roughly one-third of the total parameters, and the remaining two-thirds of the parameters are concentrated in the feedforward neural network (FFN) structure.

The MHA component primarily serves to gauge the relationships and connections between words or pieces of knowledge, facilitating the integration of global information. It’s more geared toward establishing contextual connections rather than storing specific knowledge points. Therefore, it’s reasonable to infer that the substantial knowledge base of the LLM model is primarily housed within the FFN structure of the Transformer.

However, the granularity of such positioning is still too coarse, and it is difficult to answer how a specific piece of knowledge is stored and retrieved.

A relatively novel perspective, introduced in the article “Transformer Feed-Forward Layers Are Key-Value Memories,” suggests that the feedforward neural network (FFN) layers in the Transformer architecture function as a Key-Value memory system storing a wealth of specific knowledge. The figure below illustrates this concept, with annotations on the right side for improved clarity (since the original paper’s figure on the left side can be somewhat challenging to grasp).

In this Key-Value memory framework, the first layer of the FFN serves as the Key layer, characterized by a wide hidden layer, while the second layer combines a narrow hidden layer with the Value layer. The input to the FFN layer corresponds to the output embedding generated by the Multi-Head Attention (MHA) mechanism for a specific word, encapsulating the comprehensive context information drawn from the entire input sentence via self-attention.

Each neuron node in the Key layer stores a pair of information. For instance, the node in the first hidden layer of the FFN may record the knowledge.

The Key vector associated with a node essentially acts as a pattern detector, aiming to identify specific language or knowledge patterns within the input. If a relevant pattern is detected, the input vector and the key node’s weight are computed via the vector inner product, followed by the application of the Rectified Linear Unit (ReLU) activation function, signalling that the pattern has been detected. The resulting response value is then propagated to the second FFN layer through the Value weights of the node.

In essence, the FFN’s forward propagation process resembles the detection of a specific knowledge pattern using the Key, retrieving the corresponding Value, and incorporating it into the second FFN layer’s output. As each node in the second FFN layer aggregates information from all nodes in the Key layer, it generates a mixed response, with the collective response across all nodes in the Value layer serving as probability distribution information for the output word.

It may still sound complicated, so let’s use an extreme example to illustrate. We assume that the node in the above figure is the Key-Value memory that records the knowledge. Its Key vector is used to detect the knowledge pattern “The capital of China is…” and its Value. The vector basically stores a vector that is close to the Embedding of the word “Beijing”. When the input of the Transformer is “The capital of China is [Mask]”, the node detects this knowledge pattern from the input layer, so it generates a larger response output. We assume that other neurons in the Key layer have no response to this input, then the corresponding node in the Value layer will actually only receive the word embedding corresponding to the Value “Beijing”, and pass the large response value to perform further numerical calculations. enlarge. Therefore, the output corresponding to the Mask position will naturally output the word “Beijing”. It’s basically this process. It looks complicated, but it’s actually very simple.

Moreover, this article also pointed out that the low-level Transformer responds to the surface pattern of sentences, and the high-level responds to the semantic pattern. That is to say, the low-level FFN stores surface knowledge such as lexicon and syntax, and the middle and high-level layers store semantic and factual concept knowledge. This is consistent with other research The conclusion is consistent.

I would guess that the idea of treating FFN as a Key-Value memory is probably not the final correct answer, but it is probably not too far from the final correct answer.

Knowledge Correction in LLM: Adapting to Evolving Information

As we’ve established that specific pieces of factual knowledge reside in the parameters of one or more feedforward neural network (FFN) nodes within the Large Language Models (LLM), it’s only natural to ponder the feasibility of correcting erroneous or outdated knowledge stored within these models.

Let’s consider an example to illustrate this point. If you were to ask, “Who is the current Prime Minister of the United Kingdom?” in a dynamic political landscape where British Prime Ministers frequently change, would LLM tend to produce “Boris” or “Sunak” as the answer?

In such a scenario, the model is likely to encounter a higher volume of training data containing “Boris.” Consequently, there’s a considerable risk that LLM could provide an incorrect response. Therefore, there arises a compelling need to address the issue of correcting outdated or erroneous knowledge stored within the LLM.

By exploring strategies to rectify knowledge within LLM and adapting it to reflect real-time developments and evolving information, we take a step closer to harnessing the full potential of these models in providing up-to-date and accurate answers to questions that involve constantly changing facts or details. This endeavour forms an integral part of enhancing the reliability and relevance of LLM in practical applications.

Methods for Modifying Knowledge in LLM

Currently, there are three distinctive approaches for modifying knowledge within Large Language Models (LLMs):

Correcting Knowledge at the Source

This method aims to rectify knowledge errors by identifying the specific training data responsible for the erroneous knowledge in the LLM. With advancements in research, it’s possible to trace back the source data that led the LLM to learn a particular piece of knowledge. In practical terms, this means we can identify the training data associated with a specific knowledge item, allowing us to potentially delete or amend the relevant data source. However, this approach has limitations, particularly when dealing with minor knowledge corrections. The need for retraining the entire model to implement even small adjustments can be prohibitively costly. Therefore, this method is better suited for large-scale data removal, such as addressing bias or eliminating toxic content.

Fine-Tuning to Correct Knowledge

This approach involves constructing new training data containing the desired knowledge corrections. The LLM model is then fine-tuned on this data, guiding it to remember new knowledge and forget old knowledge.

While straightforward, it presents challenges, such as the issue of “catastrophic forgetting,” where fine-tuning leads the model to forget not only the targeted knowledge but also other essential knowledge. Given the large size of current LLMs, frequent fine-tuning can be computationally expensive.

Directly Modifying Model Parameters

In this method, knowledge correction is achieved by directly altering the LLM’s model parameters associated with specific knowledge. For instance, if we wish to update the knowledge from “<UK, current Prime Minister, Boris>” to “<UK, current Prime Minister, Sunak>”, we locate the FFN node storing the old knowledge within the LLM parameters.

Subsequently, we forcibly adjust and replace the relevant model parameters within the FFN to reflect the new knowledge. This approach involves two key components: the ability to pinpoint the storage location of knowledge within the LLM parameter space and the capacity to alter model parameters for knowledge correction. Deeper insight into this knowledge revision process contributes to a more profound understanding of LLMs’ internal mechanisms.

These methods provide a foundation for adapting and correcting the knowledge within LLMs, ensuring that these models can produce accurate and up-to-date information in response to ever-changing real-world scenarios.


What’s Next

The next blog is about the Technical Review 03: Scale Effect: What happens when LLM gets bigger and bigger

Technical Review 01: Large Language Model (LLM) and NLP Research Paradigm Transformation

TL;DR: This blog explores the profound influence of ChatGPT, awakening various sectors – the general public, academia, and industry – to the developmental philosophy of Large Language Models (LLMs). It delves into OpenAI’s prominent role and analyzes the transformative effect of LLMs on Natural Language Processing (NLP) research paradigms. Additionally, it contemplates future prospects for the ideal LLM.

  1. AI Assitant Summary
  2. Introduction
  3. NLP Research Paradigm Transformation
    1. Paradigm Shift 1.0 (2013): From Deep Learning to Two-Stage Pre-trained Models
      1. Impact 1: The Decline of Intermediate Tasks
      2. Impact 2: Standardization of Technical Approaches Across All Areas
    2. Paradigm Shift 2.0 (2020): Moving from Pre-Trained Models to General Artificial Intelligence (AGI)
  4. The Ideal Large Language Model (LLM)
    1. Impact 1: LLM Adapting Humans NEEDS with Natural Interfaces
    2. Impact 2: Many NLP subfields no longer have independent research value
    3. Impact 3: More research fields other than NLP will be included in the LLM technology system
  5. Reference

AI Assitant Summary

This blog discusses the impact of ChatGPT and the awakening it brought to the understanding of Large Language Models (LLM). It emphasizes the importance of the development philosophy behind LLM and notes OpenAI’s leading position, followed by Google, with DeepMind and Meta catching up. The article highlights OpenAI’s contributions to LLM technology and the global hierarchy in this domain.

What is Gen-AI’s superpower?

The blog is divided into two main sections: the NLP research paradigm transformation and the ideal Large Language Model (LLM).

In the NLP research paradigm transformation section, there are two significant paradigm shifts discussed. The first shift, from deep learning to two-stage pre-trained models, marked the introduction of models like Bert and GPT. This shift led to the decline of intermediate tasks in NLP and the standardization of technical approaches across different NLP subfields.

The second paradigm shift focuses on the move from pre-trained models to General Artificial Intelligence (AGI). The blog highlights the impact of ChatGPT in bridging the gap between humans and LLMs, allowing LLMs to adapt to human commands and preferences. It also suggests that many independently existing NLP research fields will be incorporated into the LLM technology system, while other fields outside of NLP will also be included. The ultimate goal is to achieve an ideal LLM that is a domain-independent general artificial intelligence model.

In the section on the ideal Large Language Model (LLM), the blog discusses the characteristics and capabilities of an ideal LLM. It emphasizes the self-learning capabilities of LLMs, the ability to tackle problems across different subfields, and the importance of adapting LLMs to user-friendly interfaces. It also mentions the impact of ChatGPT in integrating human preferences into LLMs and the future potential for LLMs to expand into other fields such as image processing and multimodal tasks.

Overall, the blog provides insights into the impact of ChatGPT, the hierarchy in LLM development, and the future directions for LLM technology.

Introduction

Since the emergence of OpenAI ChatGPT, many people and companies have been both surprised and awakened in academia and industry. I was pleasantly surprised because I did not expect a Large Language Model (LLM) to be as effective at this level, and I was also shocked because most of our academic & industrial understanding of LLM and its development philosophy is far from the world’s most advanced ideas. This blog series covers my reviews, reflections, and thoughts about LLM.

From GPT 3.0, LLM is not merely a specific technology; it actually embodies a developmental concept that outlines where LLM should be heading. From a technical standpoint, I personally believe that the main gap exists in the different understanding of LLM and its development philosophy for the future regardless of the financial resources to build LLM.

While many AI-related companies are currently in a “critical stage of survival,” I don’t believe it is as dire as it may seem. OpenAI is the only organization with a forward-thinking vision in the world. ChatGPT has demonstrated exceptional performance that has left everyone trailing behind, even super companies like Google lag behind in their understanding of the LLM development concept and version of their products.

In the field of LLM (Language Model), there is a clear hierarchy. OpenAI is leading internationally, being about six months to a year ahead of Google and DeepMind, and approximately two years ahead of China. Google holds the second position, with technologies like PaLM 1/2, Pathways and Generative AI on GCP Vertex AI, which aligns with their technical vision. These were launched between February and April of 2022, around the same time as OpenAI’s InstructGPT 3.5. This highlights the difference between Google and OpenAI.

DeepMind has mainly focused on reinforcement learning for games and AI in science. They started paying attention to LLM in 2021, and they are currently catching up. Meta AI, previously known as Facebook AI, hasn’t prioritized LLM in the past, but they are now trying to catch up with recently open-sourced Llama 2. These institutions are currently among the best in the field.

To summarize the mainstream LLM technology I mainly focus on the Transformer, BERT, GPT and ChatGPT <=4.0.

NLP Research Paradigm Transformation

Taking a look back at the early days of deep learning in Natural Language Processing (NLP), we can see significant milestones over the past two decades. There have been two major shifts in the technology of NLP.

Paradigm Shift 1.0 (2013): From Deep Learning to Two-Stage Pre-trained Models

The period of this paradigm shift encompasses roughly the time frame from the introduction of deep learning into the field of NLP, around 2013, up until just before the emergence of GPT 3.0, which occurred around May 2020.

Prior to the rise of models like BERT and GPT, the prevailing technology in the NLP field was deep learning. It was primarily reliant on two core technologies:

  1. A plethora of enhanced LSTM models and a smaller number of improved ConvNet models served as typical Feature Extractors.
  2. A prevalent technical framework for various specific tasks was based on Sequence-to-Sequence (or Encoder-Decoder) Architectures coupled with Attention mechanisms.

With these foundational technologies in place, the primary research focus in deep learning for NLP revolved around how to effectively increase model depth and parameter capacity. This involved the continual addition of deeper LSTM or CNN layers to encoders and decoders with the aim of enhancing layer depth and model capacity. Despite these efforts successfully deepening the models, their overall effectiveness in solving specific tasks was somewhat limited. In other words, the advantages gained compared to non-deep learning methods were not particularly significant.

The difficulties that have held back the success of deep learning in NLP can be attributed to two main issues:

  1. Scarcity of Training Data: One significant challenge is the lack of enough training data for specific tasks. As the model becomes more complex, it requires more data to work effectively. This used to be a major problem in NLP research before the introduction of pre-trained models.
  2. Limited Ability of LSTM/CNN Feature Extractors: Another issue is that the feature extractors using LSTM/CNN are not versatile enough. This means that, no matter how much data you have, the model struggles to make good use of it because it can’t effectively capture and utilize the information within the data.

These two factors seem to be the primary obstacles that have prevented deep learning from making significant advancements in the field of NLP.

The advent of two pre-training models, Bert and GPT, marks a significant technological advancement in the field of NLP.

About a year after the introduction of Bert, the technological landscape had essentially consolidated into these two core models.

This development has had a profound impact on both academic research and industrial applications, leading to a complete transformation of the research paradigm in the field. The impact of this paradigm shift can be observed in two key areas:

  • firstly, a decline and, in some cases, the gradual obsolescence of certain NLP research subfields;
  • secondly, the growing standardization of technical methods and frameworks across different NLP subfields.

Impact 1: The Decline of Intermediate Tasks

In the field of NLP, tasks can be categorized into two major groups: “intermediate tasks” and “final tasks.”

  • Intermediate tasks, such as word segmentation, part-of-speech tagging, and syntactic analysis, don’t directly address real-world needs but rather serve as preparatory stages for solving actual tasks. For example, the user doesn’t require a syntactic analysis tree; they just want an accurate translation.
  • In contrast, “final tasks,” like text classification and machine translation, directly fulfil user needs.

Intermediate tasks initially arose due to the limited capabilities of early NLP technology. Researchers segmented complex problems like Machine Translation into simpler intermediate stages because tackling them all at once was challenging. However, the emergence of Bert/GPT has made many of these intermediate tasks obsolete. These models, through extensive pre-training on data, have incorporated these intermediate tasks as linguistic features within their parameters. As a result, we can now address final tasks directly, without modelling these intermediary processes.

Even Chinese word segmentation, a potentially controversial example, follows the same principle. We no longer need to determine which words should constitute a phrase; instead, we let Large Language Models (LLM) learn this as a feature. As long as it contributes to task-solving, LLM will naturally grasp it. This may not align with conventional human word segmentation rules.

In light of these developments, it’s evident that with the advent of Bert/GPT, NLP intermediate tasks are gradually becoming obsolete.

Impact 2: Standardization of Technical Approaches Across All Areas

Within the realm of “final tasks,” there are essentially two categories: natural language understanding tasks and natural language generation tasks.

  • Natural language understanding tasks, such as text classification and sentiment analysis, involve categorizing input text.
  • In contrast, natural language generation tasks encompass areas like chatbots, machine translation, and text summarization, where the model generates output text based on input.

Since the introduction of the Bert/GPT models, a clear trend towards technical standardization has emerged.

Firstly, feature extractors across various NLP subfields have shifted from LSTM/CNN to Transformer. The writing was on the wall shortly after Bert’s debut, and this transition became an inevitable trend.

Currently, Transformer not only unifies NLP but is also gradually supplanting other models like CNN in various image processing tasks. Multi-modal models have similarly adopted the Transformer framework. This Transformer journey, starting in NLP, is expanding into various AI domains, kickstarted by the Vision Transformer (ViT) in late 2020. This expansion shows no signs of slowing down and is likely to accelerate further.

Secondly, most NLP subfields have adopted a two-stage model: model pre-training followed by application fine-tuning or Zero/Few Shot Prompt application.

To be more specific, various NLP tasks have converged into two pre-training model frameworks:

  • For natural language understanding tasks, the “bidirectional language model pre-training + application fine-tuning” model represented by Bert has become the standard.
  • For natural language generation tasks, the “autoregressive language model (i.e., one-way language model from left to right) + Zero/Few Shot Prompt” model represented by GPT 2.0 is now the norm.

Though these models may appear similar, they are rooted in distinct development philosophies, leading to divergent future directions. Regrettably, many of us initially underestimated the potential of GPT’s development route, instead placing more focus on Bert’s model.

Paradigm Shift 2.0 (2020): Moving from Pre-Trained Models to General Artificial Intelligence (AGI)

This paradigm shift began around the time GPT 3.0 emerged, approximately in June 2020, and we are currently undergoing this transition.

ChatGPT served as a pivotal point in initiating this paradigm shift. However, before the appearance of InstructGPT, Large Language Models (LLM) were in a transitional phase.

Transition Period: Dominance of the “Autoregressive Language Model + Prompting” Model as Seen in GPT 3.0

As mentioned earlier, during the early stages of pre-training model development, the technical landscape primarily converged into two distinct paradigms: the Bert mode and the GPT mode. Bert was the favoured path, with several technical improvements aligning with that direction. However, as technology progressed, we observed that the largest LLM models currently in use are predominantly based on the “autoregressive language model + Prompting” model, similar to GPT 3.0. Models like GPT 3, PaLM, GLaM, Gopher, Chinchilla, MT-NLG, LaMDA, and more all adhere to this model, without exceptions.

Why has this become the prevailing trend? There are likely two key reasons driving this shift, and I believe they are at the forefront of this transition.

Firstly, Google’s T5 model plays a crucial role in formally uniting the external expressions of both natural language understanding and natural language generation tasks. In the T5 model, tasks that involve natural language understanding, like text classification and determining sentence similarity (marked in red and yellow in the figure above), align with generation tasks in terms of input and output format.

This means that classification tasks can be transformed within the LLM model to generate corresponding category strings, achieving a seamless integration of understanding and generation tasks. This compatibility allows natural language generation tasks to harmonize with natural language understanding tasks, a feat that would be more challenging to accomplish the other way around.

The second reason is that if you aim to excel at zero-shot prompting or few-shot prompting, the GPT mode is essential.

Now, recent studies, as referenced in “On the Role of Bidirectionality in Language Model Pre-Training,” demonstrate that when downstream tasks are resolved during fine-tuning, the Bert mode outperforms the GPT mode. Conversely, if you employ zero-shot or few-shot prompting to tackle downstream tasks, the GPT mode surpasses the Bert mode.

But this leads to an important question: Why do we strive to use zero-shot or few-shot prompting for task completion? To answer this question, we first need to address another: What type of Large Language Model (LLM) is the most ideal for our needs?

The Ideal Large Language Model (LLM)

The image above illustrates the characteristics of an ideal Large Language Model (LLM). Firstly, the LLM should possess robust self-learning capabilities. When fed with various types of data such as text and images from the world, it should autonomously acquire the knowledge contained within. This learning process should require no human intervention, and the LLM should be adept at flexibly applying this knowledge to address real-world challenges. Given the vastness of the data, this model will naturally be substantial in size, a true giant model.

Secondly, the LLM should be capable of tackling problems across any subfield of Natural Language Processing (NLP) and extend its capabilities to domains beyond NLP. Ideally, it should proficiently address queries from any discipline.

Moreover, when we utilize the LLM to resolve issues in a particular field, the LLM should understand human commands and use expressions that align with human conventions. In essence, it should adapt to humans, rather than requiring humans to adapt to the LLM model.

A common example of people adapting to LLM is the need to brainstorm and experiment with various prompts in order to find the best prompts for a specific problem. In this context, the figure above provides several examples at the interface level where humans interact with the LLM, illustrating the ideal interface design for users to effectively utilize the LLM model.

Now, let’s revisit the question: Why should we pursue zero-shot/few-shot prompting to complete tasks? There are two key reasons:

  1. The Enormous Scale of LLM Models: Building and modifying LLM models of this scale requires immense resources and expertise, and very few institutions can undertake this. However, there are numerous small and medium-sized organizations and even individuals who require the services of such models. Even if these models are open-sourced, many lack the means to deploy and fine-tune them. Therefore, an approach that allows task requesters to complete tasks without tweaking the model parameters is essential. In this context, prompt-based methods offer a solution to fulfil tasks without relying on fine-tuning (note that soft prompting deviates from this trend). LLM model creators aim to make LLM a public utility, operating it as a service. To accommodate the evolving needs of users, model producers must strive to enable LLM to perform a wide range of tasks. This objective is a byproduct and a practical reason why large models inevitably move toward achieving General Artificial Intelligence (AGI).
  2. The Evolution of Prompting Methods: Whether it’s zero-shot prompting, few-shot prompting, or the more advanced Chain of Thought (CoT) prompting that enhances LLM’s reasoning abilities, these methods align with the technology found in the interface layer illustrated earlier. The original aim of zero-shot prompting was to create the ideal interface between humans and LLM, using the task expressions that humans are familiar with. However, it was found that LLM struggled to understand and perform well with this approach. Subsequent research revealed that when a few examples were provided to represent the task description, LLM’s performance improved, leading to the exploration of better few-shot prompting technologies. In essence, our initial hope was for LLM to understand and execute tasks using natural, human-friendly commands. However, given the current technological limitations, these alternative methods have been adopted to express human task requirements.

Understanding this logic, it becomes evident that few-shot prompting, also known as In Context Learning, is a transitional technology. When we can describe a task more naturally and LLM can comprehend it, we will undoubtedly abandon these transitional methods. The reason is clear: using these approaches to articulate task requirements does not align with human habits and usage patterns.

This is also why I classify GPT 3.0+Prompting as a transitional technology. The arrival of ChatGPT has disrupted this existing state of affairs by introducing Instruct instead of Prompting. This change marks a new technological paradigm shift and has subsequently led to several significant consequences.

Impact 1: LLM Adapting Humans NEEDS with Natural Interfaces

In the context of an ideal LLM, let’s focus on ChatGPT to grasp its technical significance. ChatGPT stands out as one of the technologies that align most closely with the ideal LLM, characterized by its remarkable attributes: “Powerful and considerate.”

This “powerful capability” can be primarily attributed to the foundation provided by the underlying LLM, GPT 3.5, on which ChatGPT relies. While ChatGPT includes some manually annotated data, the scale is relatively small, amounting to tens of thousands of examples. In contrast, GPT 3.5 was trained on hundreds of billions of token-level data, making this additional data negligible in terms of its contribution to the vast wealth of world knowledge and common sense already embedded in GPT 3.5. Hence, ChatGPT’s power primarily derives from the GPT 3.5 model, which sets the benchmark for the ideal LLM models.

But does ChatGPT infuse new knowledge into the GPT 3.5 model? Yes, it does, but this knowledge isn’t about facts or world knowledge; it’s about human preferences. “Human preference” encompasses a few key aspects:

  • First and foremost, it involves how humans naturally express tasks. For instance, humans typically say, “Translate the following sentence from Chinese to English” to convey the need for “machine translation.” But LLMs aren’t humans, so understanding such commands is a challenge. To bridge this gap, ChatGPT introduces this knowledge into GPT 3.5 through manual data annotation, making it easier for the LLM to comprehend human commands. This is what empowers ChatGPT with “empathy.”
  • Secondly, humans have their own standards for what constitutes a good or bad answer. For example, a detailed response is deemed good, while an answer containing discriminatory content is considered bad. The feedback data that people provide to LLM through the Reward Model embodies this quality preference. In essence, ChatGPT imparts human preference knowledge to GPT 3.5, resulting in an LLM that comprehends human language and is more polite.

The most significant contribution of ChatGPT is its achievement of the interface layer of the ideal LLM. It allows the LLM to adapt to how people naturally express commands, rather than requiring people to adapt to the LLM’s capabilities and devise intricate command interfaces. This shift enhances the usability and user experience of LLM.

It was InstructGPT/ChatGPT that initially recognized this challenge and offered a viable solution. This is also their most noteworthy technical contribution. In comparison to prior few-shot prompting methods, it is a human-computer interface technology that aligns better with human communication habits for interacting with LLM.

This achievement is expected to inspire subsequent LLM models and encourage further efforts in creating user-friendly human-computer interfaces, ultimately making LLM more responsive to human needs.

Impact 2: Many NLP subfields no longer have independent research value

In the realm of NLP, this paradigm shift signifies that many independently existing NLP research fields will be incorporated into the LLM technology framework, gradually losing their independent status and fading away. Following the initial paradigm shift, while numerous “intermediate tasks” in NLP are no longer required as independent research areas, most of the “final tasks” remain and have transitioned to a “pre-training + fine-tuning” framework, sparking various improvement initiatives to tackle specific domain challenges.

Current research demonstrates that for many NLP tasks, as the scale of LLM models increases, their performance significantly improves. From this, one can infer that many of the so-called “unique” challenges in a given field likely stem from a lack of domain knowledge. With sufficient domain knowledge, these seemingly field-specific issues can be effectively resolved. Thus, there’s often no need to focus intensely on field-specific problems and devise specialized solutions. The path to achieving AGI might be surprisingly straightforward: provide more data in a given field to the LLM and let it autonomously accumulate knowledge.

In this context, ChatGPT proves that we can now directly pursue the ideal LLM model. Therefore, the future technological trend should involve the pursuit of ever-larger LLM models by expanding the diversity of pre-training data, allowing LLMs to independently acquire domain-specific knowledge through pre-training. As the model scale continues to grow, numerous problems will be addressed, and the research focus will shift to constructing this ideal LLM model rather than solving field-specific problems. Consequently, more NLP subfields will be integrated into the LLM technology system and gradually phase out.

In my view, the criteria for determining whether independent research in a specific field should cease can be one of the following two methods:

  • First, assess whether the LLM’s research performance surpasses human performance for a particular task. For fields where LLM outperforms humans, there is no need for independent research. For instance, for many tasks within the GLUE and SuperGLUE test sets, LLMs currently outperform humans, rendering independently existing research fields closely associated with these datasets unnecessary.
  • Second, compare task performance between the two modes. The first mode involves fine-tuning with extensive domain-specific data, while the second mode employs few-shot prompting or instruct-based techniques. If the second mode matches or surpasses the performance of the first, it indicates that the field no longer needs to exist independently. By this standard, many research fields currently favour fine-tuning (due to the abundance of training data), seemingly justifying their independent existence. However, as models grow in size, the effectiveness of few-shot prompting continues to rise, and it’s likely that this turning point will be reached in the near future.

If these speculations hold true, it presents the following challenging realities:

  • For many NLP researchers, they must decide which path to pursue. Should they persist in addressing field-specific challenges?
  • Or should they abandon what may seem like a less promising route and instead focus on constructing a superior LLM?
  • If the choice is to invest in LLM development, which institutions possess the ability and resources to undertake this endeavour?
  • What’s your response to this question?

Impact 3: More research fields other than NLP will be included in the LLM technology system

From the perspective of AGI, referring to the ideal LLM model described previously, the tasks it can complete should not be limited to the NLP field or one or two subject areas. The ideal LLM should be a domain-independent general artificial intelligence model. , it is now doing well in one or two fields, but it does not mean that it can only do these tasks.

The emergence of ChatGPT proves that it is feasible for us to pursue AGI in this period, and now is the time to put aside the shackles of “field discipline” thinking.

In addition to demonstrating its ability to solve various NLP tasks in a smooth conversational format, ChatGPT also has powerful coding capabilities. Naturally, more and more other research fields will be gradually included in the LLM system and become part of general artificial intelligence.

LLM expands its field from NLP to the outside world. A natural choice is image processing and multi-modal related tasks. There are already some efforts to integrate multimodality and make LLM a universal human-computer interface that supports multimodal input and output. Typical examples include DeepMind’s Flamingo and Microsoft’s “Language Models are General-Purpose Interfaces”, as shown above. The conceptual structure of this approach is demonstrated.

My judgment is that whether it is images or multi-modality, the future integration into LLM to become useful functions may be slower than we think.

The main reason is that although the image field has been imitating Bert’s pre-training approach in the past two years, it is trying to introduce self-supervised learning to release the model’s ability to independently learn knowledge from image data. Typical technologies are “contrastive learning” and MAE. These are two different technical routes.

However, judging from the current results, despite great technological progress, it seems that this road has not yet been completed. This is reflected in the application of pre-trained models in the image field to downstream tasks, which brings far fewer benefits than Bert or GPT. The application is significant in NLP downstream tasks.

Therefore, image preprocessing models still need to be explored in depth to unleash the potential of image data, which will delay their unification into large LLM models. Of course, if this road is opened one day, there is a high probability that the current situation in the field of NLP will be repeated, that is, various research subfields of image processing may gradually disappear and be integrated into large-scale LLM to directly complete terminal tasks.

In addition to images and multi-modality, it is obvious that other fields will gradually be included in the ideal LLM. This direction is in the ascendant and is a high-value research topic.

The above are my personal thoughts on paradigm shift. Next, let’s sort out the mainstream technological progress of the LLM model after GPT 3.0.

As shown in the ideal LLM model, related technologies can actually be divided into two major categories;

  • One category is about how the LLM model absorbs knowledge from data and also includes the impact of model size growth on LLM’s ability to absorb knowledge;
  • The second category is about human-computer interfaces about how people use the inherent capabilities of LLM to solve tasks, including In Context Learning and Instruct modes. Chain of Thought (CoT) prompting, an LLM reasoning technology, essentially belongs to In Context Learning. Because they are more important, I will talk about them separately.

Reference

  • ​​1.  Vaswani, A. et al. Transformer: Attention Is All You Need. https://arxiv.org/pdf/1706.03762.pdf (2017).
  • ​2.  Openai, A. R., Openai, K. N., Openai, T. S. & Openai, I. S. GPT: Improving Language Understanding by Generative Pre-Training. (2018).
  • ​3.  Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. (2018).
  • ​4.  Radford, A. et al. GPT2: Language Models are Unsupervised Multitask Learners. (2019).
  • ​5.  Brown, T. B. et al. GPT3: Language Models are Few-Shot Learners. (2020).
  • ​6.  Ouyang, L. et al. GPT 3.5: Training language models to follow instructions with human feedback. (2022).
  • ​7.  Eloundou, T., Manning, S., Mishkin, P. & Rock, D. GPT4: GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models. (2023).
  • ​8.  OpenAI. Introducing ChatGPT. https://openai.com/blog/chatgpt (2022).
  • ​9.  Schulman, J., Wolski, F., Dhariwal, P., Radford, A. & Openai, O. K. Proximal Policy Optimization Algorithms. (2017).

Deep Learning Recommender System – Part 1: Technical Framework

[2023-06 Update Review]: add more details about related terms for better understanding.

In this blog, I will review the Classic Technical Framework of the modern (Deep Learning) recommendation system (aka. recommender system).

Before Starting

Before I start, I want to ask readers a question:

What is the first thing you want to do when you start learning a new field X?

Of course, everyone has their own answers, but for me, there are two questions I want to know the most at the beginning. Here X is the recommender system.

What problem is this X trying to solve?

Is there a high-level mind map so that I can understand the basic concepts, main technologies and development requirements in this X?

Moreover, for the field of “deep learning recommendation system”, there may be a third question.

Why do people keep emphasizing “deep learning“, and what revolutionary impact does deep learning bring to the recommendation system?

I hope you will find answers to these three questions after reading this blog.

What is the fundamental problem to be solved by a recommender system?

The applications of recommender systems have gotten into all aspects of life, such as shopping, entertainment, and learning. Although the recommendation scenarios such as product recommendation, video recommendation, and news recommendation may be completely different, since they are all called “recommender systems”, the essential problem to be solved must be the same and follow a common logical framework.

The problem to be solved by the recommender system can be summed up in one sentence:

In the information overload era, how can Users efficiently obtain the Items of their interest?

Therefore, the recommendation system is a bridge built between “the overloaded Internet information” and “users’ interests“.

Let’s look at the recommender system’s abstract logical architecture, and then build its technical architecture step by step, so you can have an overall impression of the entire system.

The logical architecture of the recommender system

Starting from the fundamental problem of the recommendation system, we can clearly see that the recommendation system is actually dealing with the relationship between “people” and “information“. That is, based on “people” and “information” to construct a method of finding interesting information for the people.

  • User – From the perspective of “people“, in order to more reliably infer the interests of “people”, the recommender system hopes to use a large amount of information related to “people”, including historical behaviour, population attributes, relationship networks, etc. They may be collectively referred to as “User Information“.
  • Item – The definition of “information” has specific meanings and diverse interpretations in different scenarios. For example, it refers to “product information” in product recommendations, “video information” in video recommendations, and “news information” in news recommendations. We can collectively refer to them as “Item Information” for convenience.
  • Content – In addition, in a specific recommendation scenario, the user’s final selection is generally affected by a series of environmental information such as time, location, and user status, which can also be called “scene information” or “context information”.

With these definitions, the problem to be dealt with by the recommender system can be formally defined as:

For a certain user U (User), in a specific scenario C (Context), build a function for massive “item” information, predict the user’s preference for a specific candidate item I (Item), and then sort all candidate items according to the preference to generate a recommendation list.

In this way, we can abstract the logical framework of the recommender system, as shown in Figure 1. Although this logical framework is relatively simple, it is on this simple basis that we can refine and expand each module to produce the entire technical system.

Figure 1. The logical architecture diagram of the recommender system includes the Candidate Item, the Recommender System Model of the User, the Item and Content, and the final Recommendation List.

The Revolution of Deep Learning for Recommender Systems

With the logical architecture of the recommender system (Figure 1), we can answer the third question from the beginning:

What revolutionary impact does deep learning bring to the recommender system?

In the logic architecture diagram, the central position is an abstract function f(U, I, C), which is responsible for “guessing” the user’s heart and “scoring” the items that the user may be interested in so as to obtain the final recommended item list. In the recommender system, this function is generally referred to as the “recommendation system model” (hereinafter referred to as the “recommendation model“).

The application of deep learning to recommendation systems can greatly enhance the fitting and expressive capabilities of recommendation models. Simply put, deep learning aims to make the recommendation model “guess more accurately” and better capture the “heart” of users.

So you may still not have a clear concept. Next, let’s compare the difference between the traditional machine learning recommendation model and the deep learning recommendation model from the perspective of the model structure so that you can have a clearer understanding.

Here is a model structure comparison chart in Figure 2, which compares the difference between the traditional Matrix Factorization model and the Deep Learning Matrix Factorization model.

Figure 2. Traditional Matrix Factorization model vs Deep Learning Matrix Factorization model (Neural collaborative filtering). Source: https://arxiv.org/abs/1708.05031

Let’s ignore the details for now. How do you feel at first glance?

Do you feel that the deep learning model has become more complex, layer after layer, and the number of layers has increased a lot?

In addition, the flexible model structure of the deep learning model also gives it an irreplaceable advantage; that is, we can let the neural network of the deep learning model simulate the changing process of many user interests and even the thinking process of the user making a decision.

For example, Alibaba‘s deep learning model, Deep Interest Evolution Network (Figure 3 DIEN), uses the structure of a three-layer sequence model to simulate the process of users’ interest evolution when purchasing goods, while such a powerful data fitting ability to interpret user behaviour is not available in traditional machine learning models.

Figure 3. Alibaba’s Deep Interest Evolution Network (DIEN) for Click-Through Rate (CTR) Prediction. AUGRU (GRU with attentional update gate) models the interest-evolving process that is relative to the target item. GRU denotes Gated Recurrent Units, which overcomes the vanishing gradients problem of RNN and is faster than LSTM. Source: https://arxiv.org/abs/1809.03672.

Moreover, the revolutionary impact of deep learning on recommender systems goes far beyond that. In recent years, due to the greatly increased structural complexity of deep learning models, more and more data streams are required to converge for the model training, testing and serving. The data storage, processing and updating modules related to the recommender systems on the cloud computing platforms have also entered the “deep learning era”.

After talking so much about the impact of deep learning on recommender systems, we seem to have not seen a complete deep learning recommender system architecture. Don’t worry. Let’s talk about what the technical architecture of a classic deep-learning recommender system looks like in the next section.

Technical Architecture of Deep Learning Recommendation System

The architecture of the deep learning recommender system is in the same line as the classic one. It improves some specific modules of the classic architecture to enable and support the application of deep learning. So, I will first talk about the classic recommender system architecture and then talk about deep learning and its improvements.

In the actual recommender system, there are two types of problems that engineers need to focus on in projects.

Type 1 is about data and information. What are “user information”, “item information”, and “scenario information”? How to store, update and process these data?

Type 2 is on recommendation algorithms and models. How to train, predict, and achieve better recommendation results for the system?

An industrial recommendation system’s technical architecture is based on these two parts.

  • The “data and information” part has gradually developed into a data flow framework that integrates offline (nearline) batch processing of data and real-time stream processing in the recommender system.
  • The “model and algorithm” part is further refined into a model framework that integrates training, evaluation, deployment, and online inference in the recommender system.

Based on this, we can summarize the technical architecture diagram of the recommender system as in Figure 4.

Figure 4. Diagram of the technical architecture of the recommendation system

In Figure 4, I divided the technical architecture of the recommender system into a “data part” and a “model part”.

Part 1: Data Framework

The “data part” of the recommender system is mainly responsible for the collection and processing of “user“, “item” and “content” information. Based on the difference in the amount of data and the real-time processing requirements, we use three different data processing methods sorted by order of real-time performance, they are:

  1. Client and Server end-to-end real-time data processing.
  2. Real-time stream data processing.
  3. Big data offline processing.

From 1 to 3, the real-time performance of these methods decreases from strong to weak, and the massive data processing capabilities of the three methods increase from weak to strong.

MethodReal-Time PerformanceData Processing CapabilityPossible Solutions
Client and Server end-to-end real-time data processingStrong WeakFlink
Real-time stream data processingMediumMedium
Big data offline processingWeakStrongSpark

Therefore, the data flow system of a mature recommender system will complement each other and use them together.

The big data computing platform (e.g., AWS, Azure, GCP, etc.) of the recommender system can extract training data, feature data, and statistical data through the processing of the system logs and metadata of items and users. So what are these data for?

Specifically, there are three downstream applications based on the data exported from the data platform:

  1. Generate the Sample Data required by the recommender system model for the training and evaluation of the algorithm model.
  2. Generate “user features“, “item features” and a part of “content features” required by the recommendation system model service (Model Serving) for online inference.
  3. Generate statistical data required for System Monitoring and Business Intelligence (BI) systems.

The data part is the “water source” of the entire recommender system. Only by ensuring the continuity and purity of the “water source” can we continuously “nourish” the recommender system so that it can operate efficiently and output accurately.

In the deep learning era, models have higher requirements for “water sources”. First of all, the amount of water must be large. Only in this way can we ensure that the deep learning models we build can converge as soon as possible; Secondly, the “water flow” should be fast so that the data can flow to the system modules for model updates and adjustments. Thus, the model can grasp the changes in user interest in real time. This is the same reason causing the rapid development and application of big data engine (Spark) and the stream computing platform (Flink).

Part II: Model Framework

The “model part” is the major body of the recommender system. The structure of the model is generally composed of a “recall layer“, a “ranking layer”, and a “supplementary (auxiliary) strategy and algorithm layer“.

  • The “recall layer” is generally composed of efficient recall rules, algorithms or simple models, which allow the recommender system to quickly recall items that users may be interested in from a massive candidate set.
  • The “ranking layer” uses the ranking/sorting model(s) to fine-sort-rank the candidate items that are initially screened by the recall layer.
  • The “supplementary strategy and algorithm layer“, also known as the “re-ranking layer“, is a combination of some supplements to take into account indicators such as “diversity“, “popularity” and “freshness” of the results before returning to the user recommendation list. The strategy and algorithm make more adjustments to the item list and finally form a user-visible recommendation list.

The “model serving process” means the recommender system model receives a full candidate item set and then generates the recommendation list.

In order to optimise the model parameters required by the model service process, we need to determine the model structure, the specific values of the different parameter weights in the structure, and the parameter values in the model-related algorithms and strategies through model training.

The training methods can be divided into “offline training” and “online updating” according to different environments.

  • The advantage of offline training is that the optimizer can use the full samples and all the features to build the model approach to the global optimal performance.
  • While online updating can “digest” new data samples in real-time, learn and reflect new data trends more quickly, and meet the real-time recommending requirements.

In addition, in order to evaluate the performance of the recommender system model and optimize the model iteratively, the model part of the recommender system also includes various evaluation modules such as “offline evaluation” and “online A/B test“, which are used to obtain offline and online indicators to guide the model iteration and optimization.

We just said that the revolution of deep learning for recommender systems is in the model part, so what are the specifics?

I summarized the most typical deep learning applications into 3 points:

  1. The application of embedding technology in the recall layer in deep learning. Embedding technology of deep learning is already a mainstream solution in the industry to support the recall layer to quickly generate user-related items.
  2. The application of deep learning models with different structures in the ranking layer. The ranking layer (also known as the fine sorting layer) is the most important factor affecting the system’s performance, and it is also the area where deep learning models show their strengths. The deep learning model has high flexibility and strong expressive ability, making it suitable for accurately sorting under large data volumes. Undoubtedly, the deep learning ranking model is a hot topic in both industry and academia. It will keep gaining investments and being rapidly iterated by researchers and engineers.
  3. The application of reinforcement learning in the direction of model updating and integration (CI/CD). Reinforcement learning is another field of machine learning closely related to deep learning. Its application in recommender systems enables the systems to take a higher level of real-time performance.

Summary

In this blog, I reviewed the technical architecture of the deep learning recommendation system. Although it involves a lot of content, you don’t have to worry about it if you cannot remember all the details. All you need is to keep the impression of this framework in your mind.

You can use the content of this framework as a technical index of a recommender system, making it your own knowledge map. Visually speaking, you can think of the content of this blog as a tree of knowledge, which has roots, stems, branches, leaves, and flowers.

Let’s recall the most important concepts again:

  • The root is that the recommender system aims to solve the challenge of how to help users efficiently obtain the items of interest in this “information overload” era.
  • The stems of the recommender system are the logical architecture of the recommendation system: for a user U (User), in a specific scenario C (Context), a function is constructed for a large number of “items” (products, videos, etc.) to predict the user’s response to a specific candidate item I (Item). ) with the degree of preference.
  • The branches and leaves are the technical modules of the recommender system and the algorithms/models of each module, respectively. The technical module is responsible for supporting the system’s technical architecture. Using algorithms and models, we can create diverse functions within the system and provide accurate results.

Finally, the application of deep learning is undoubtedly the pearl of the current technical architecture of recommender systems. It is like a flower blooming on this big tree, and it is the most wonderful finishing touch.

The structure of the deep learning model is complex, but its data fitting ability and expression ability is stronger, so the recommender model can better simulate the user’s interest-changing process and even the decision-making process. The development of deep learning has also promoted a revolution in the data flow framework, leading to higher requirements for cloud computing service providers to process the data faster and stronger.

Hope you like this blog. To be continued the Next Blog will be Deep Learning Recommender Systems Part 2: Feature Engineering.

One more thing…

Figure 5 is the recommender system of Netflix. Here is the challenge, can you combine the technical framework of the recommender system discussed in this blog to tell which parts are the data part and which are the model part in the diagram?

Figure 5. Diagram of Netflix recommender system architecture
Reference:

Deep ConvNets for Oracle Bone Script Recognition with PyTorch and Qt-GUI

Shang Dynasty Oracle Bone Scripts Images

1. Project Background

This blog demonstrates how to use Pytorch to build deep convolutional neural networks and use Qt to create the GUI with the pre-trained model. The final app runs like the figure below.

Qt GUI with pre-trained Deep ConvNets model for Oracle Bone Scripts Recognition: Model predicts the Oracle Bone Script ‘合’ with 99.8% Acc.

The five original oracle bone scripts in this sample image can be translated into modern Chinese characters as “贞,今日,其雨?” (Test: Today, will it rain?)

Please note that I am not an expert in the ancient Chinese language, and I think the translation may not be that accurate. But in the GUI, the user can draw the script in the input panel and then click the run button to get the top 10 Chinese characters ranked by probabilities. The highest result is presented with green background colour and 99.8% accuracy.

I will assume that readers have a basic understanding of the deep learning model, middle-level skills of python programming, and know a little about UX/UI design with Qt. There are awesome free tutorials on the internet or one could spend a few dollars to join online courses. I see no hurdles for people mastering these skills.

The following sections are arranged with the topics as follows. Explain the basic requirements for this project and then cover all the basic steps in detail:

  1. Init the project
  2. Create the Python Environment and Install the Dependencies
  3. Download the Raw Data and Preprocess the Data
  4. Build the Model with Pytorch
    • Review the Image
    • Test the Dataloader
    • Build the Deep ConvNets Model
    • Test the Model with Sample Images
  5. Test the Model with Qt-GUI

The source code can be found on my GitHub Repo: Oracle-Bone-Script-Recognition: Step by Step Demo; the README file contains all the basic steps to run on your local machine.

2. Basic Requirements

I used cookiecutter package to generate a skeleton of the project.

There are some opinions implicit in the project structure that has grown out of our experience with what works and what doesn’t when collaborating on data science projects. Some of the opinions are about workflows, and some of the opinions are about tools that make life easier.

  • Data is immutable
  • Notebooks are for exploration and communication (not for production)
  • Analysis is a DAG (I used the ‘Makefile’ to create command modules of the workflow)
  • Build from the environment up

Starting Requirements

  • conda 4.12.0
  • Python 3.7, 3.8 I would suggest using Anaconda for the installation of Python. Or you can just install the miniconda package, which saves a lot of space on your hard drive

3. Tutorial Step by Step

Step 1: Init the project

Use ‘git’ command to clone the project from Github.

cd PROJECT_DIR
git clone https://github.com/cuicaihao/deep-learning-for-oracle-bone-script-recognition 
# or
# gh repo clone cuicaihao/deep-learning-for-oracle-bone-script-recognition

Check the project structure.

cd deep-learning-for-oracle-bone-script-recognition
ls -l
# or 
# tree -h

You will see a similar structure as the one shown in the end. Meanwhile, you could open the ‘Makefile’ to see the raw commands of the workflow.

Step 2: Create the Python Environment and Install the Dependencies

The default setting is to create a virtual environment with Python 3.8.

make create_environment

Then, we activate the virtual environment.

conda activate oracle-bone-script-recognition

Then, we install the dependencies.

make requirements

The details of the dependencies are listed in the ‘requirements.txt’ file.

Step 3: Download the Raw Data and Preprocess the Data

This first challenge is to find a data set with the oracle bone scripts; I found this website 甲骨文 and its GitHub Repo which provided all the script images and image-to-label database I need. The image folder contains 1602 images, and the image name to Chinese character (key-value) pairs are stored in JSON, SQL and DB format, making it the perfect data set for our project startup.

甲骨文网页

we can download the raw data of the images and database of the oracle bone script. Then we will download the raw data and preprocess the data in the project data/raw directory.

make download_data

The basic step is to download repository, unzip the repo, and then make a copy of the images and database (JSON) file to the project data/raw directory.

Then, we preprocess the data to create a table (CSV file) for model development.

make create_dataset

The source code is located at src/data/make_dataset.py. The make command will provide the input arguments to this script to create two tables (CSV file) in the project data/processed directory.

The Raw Data of the Oracle Bone Scripts with the Image-Name Paris.

Step 4: Build the Model with Pytorch

This section is about model development.

4.1 Review Image and DataLoader

Before building the model, we need to review the image and data loader.

make image_review

This step will generate a series of images of the oracle bone script image sample to highlight the features of the images, such as colour, height, and width.

Besides, we show the results of different binarization methods of the original greyscale image with the tool provided by the scikit-image package.

The source code is located at src/visualization/visualize.py.

4.2 Test the DataLoader

We can still test the Dataloader with the command.

make test_dataloader

This will generate an 8×8 grid image of the oracle bone script image sample. The source code is located at src/data/make_dataloader.py.

In the image below, it generates a batch of 64 images with its label(Chinese characters) on the top-left corner.

A Batch of 8×8 Grid Images Prepared for Deep ConvNets Model

4.3 Build the Deep Convolutional Neural Networks Model

Now we can build the model. The source code is located at src/models/train_model.py. This command will generate the model and the training process records at models/.

make train_model

(Optional) One can monitor the process by using the tensorboard command.

# Open another terminal
tensorboard --logdir=models/runs

Then open the link: http://localhost:6006/ to monitor the training and validation losses, see the training batch images, and see the model graph.

After the training process, there is one model file named model_best in the models/ directory.

4.4 Test the Model with Sample Image

The pre-trained model is located at models/model_best. We can test the model with the sample image. I used the image (3653610.jpg) of the oracle bone script dataset in the Makefile test_model scripts, readers can change it to other images.

make test_model
# ...
# Chinese Character Label = 安
#      label name  count       prob
# 151    151    安      3 1.00000000
# 306    306    富      2 0.01444918
# 357    357    因      2 0.00002721
# 380    380    家      2 0.00001558
# 43      43    宜      5 0.00001120
# 586    586    会      1 0.00000136
# 311    311    膏      2 0.00000134
# 5        5    执      9 0.00000031
# 354    354    鲧      2 0.00000026
# 706    706    室      1 0.00000011

The command will generate a sample figure with a predicted label on the top and a table with the top 10 predicted labels sorted by the probability.

Model Prediction Label and Input Image

Step 5: Test the Model with Qt-GUI

Now, we have the model, we can test the model with the Qt-GUI. I used Qt Designer to create the UI file at src/ui/obs_gui.ui. Then, use the pyside6-uic command to get the Python code from the UI file `pyside6-uic src/ui/obs_gui.ui -o src/ui/obs_gui.py.

Activate the GUI by

python gui.py
# or 
# make test_gui
Draw the script of the ‘和’ and Run the Prediction
Website of the Oracle Bone Script (Index H)

The GUI contains an input drawing window for the user to scratch the oracle bone script as an image.
After the user finishes the drawing and clicks the RUN button. The input image is converted to a tensor (np.array) and fed into the model. The model will predict the label of the input image with probability which is shown on the top  Control Panel of the GUI.

  • Text Label 1: Show the Chinese character label of the input image ID and the Prediction Probability. If the Acc > 0.5, the label background colour is green; if the Acc < 0.0001, the label background colour is red. Otherwise, the label background colour is yellow.
  • Test Label 2: Show the top 10 predicted labels sorted by the probability.
  • Clean Button: Clean the input image.
  • Run Button: Run the model with the input image.
  • Translate Button: (Optional) Translate the Chinese character label to English. I did not find a good Translation service for a single character, so I left this park for future development or for the readers to think about it.

4 Summary

This repository is inspired by the most recent DeepMind’s work Predicting the past with Ithaca, I did not dig into the details of the work due to limited resources.

I think the work is very interesting, and I want to share my experience with the readers by trying a different language like Oracle Bone Scripts. It is also a good starter example for me to revisit the PyTorch deep learning packages and the qt-gui toolboxes.

I will be very grateful if you can share your experience with more readers. If you like this repository, please upvote/star it.

Conclusion

I made a formal statement on my GitHub on the first day of 2022, claiming that I would create 10 blogs on technology, but I got flattened by daily business and other work. But be a man of his word, I made my time to serve the community. Here comes the first one.

If you find the repository useful, please consider donating to the Standford Rural Area Education Program (https://sccei.fsi.stanford.edu/reap/): Policy change and research to help China’s invisible poor.

Reference

  1. Cookiecutter Data Science
  2. PyTorch Tutorial
  3. Qt for Python
  4. GitHub Chinese-Traditional-Culture/JiaGuWen
  5. Website of the Oracle Bone Script Index

-END-

Aerial Image Segmentation with Deep Learning on PyTorch

Aerial Image Labeling addresses a core topic in remote sensing: the automatic pixel-wise labelling of aerial imagery. The UNet leads to more advanced design in Aerial Image Segmentation. Future updates will gradually apply those methods to this repository.

I created the Github Repo used only one sample (kitsap11.tif ) from the public dataset (Inria Aerial Image Labelling ) to demonstrate the power of deep learning.

The original sample has been preprocessed into 1000×1000 with a 1.5-meter resolution. The following image shows the models prediction on the RGB images.

Programming details are updated on Github repos.

Dataset Features

The Inria Aerial Image Labeling addresses a core topic in remote sensing: the automatic pixel-wise labelling of aerial imagery (link to paper). Coverage of 810 km² (405 km² for training and 405 km² for testing) Aerial orthorectified colour imagery with a spatial resolution of 0.3 m Ground truth data for two semantic classes: building and not building (publicly disclosed only for the training subset) The images cover dissimilar urban settlements, ranging from densely populated areas (e.g., San Francisco’s financial district) to alpine towns (e.g,. Lienz in Austrian Tyrol).

Instead of splitting adjacent portions of the same images into the training and test subsets, different cities are included in each of the subsets. For example, images over Chicago are included in the training set (and not on the test set) and images over San Francisco are included on the test set (and not on the training set). The ultimate goal of this dataset is to assess the generalization power of the techniques: while Chicago imagery may be used for training, the system should label aerial images over other regions, with varying illumination conditions, urban landscape and time of the year.

The dataset was constructed by combining public domain imagery and public domain official building footprints.

Notification

The full data set is about 21 GB. In this repo, I select the following image as examples:

  • RGB: AerialImageDataset/train/images/kitsap11.tif (75MB)
  • GT: AerialImageDataset/train/gt/kitsap11.tif (812KB)

The original *.tif (GeoTIFF) image can be converted to a png image with the following code and the gdal package.

Reference

Github: https://github.com/cuicaihao/aerial-image-segmentation

Deep Learning Specialization on Coursera

Introduction

This repo contains all my work for this specialization. The code and images, are taken from Deep Learning Specialization on Coursera.

In five courses, you are going learn the foundations of Deep Learning, understand how to build neural networks, and learn how to lead successful machine learning projects. You will learn about Convolutional networks, RNNs, LSTM, Adam, Dropout, BatchNorm, Xavier/He initialization, and more. You will work on case studies from healthcare, autonomous driving, sign language reading, music generation, and natural language processing. You will master not only the theory, but also see how it is applied in industry. You will practice all these ideas in Python and in TensorFlow, which we will teach.

Continue reading “Deep Learning Specialization on Coursera”