Technical Review 03: Scale Effects & What happens when LLMs get bigger and bigger

AI Assitant Summary

This blog discusses the scale of Large Language Models (LLMs) and their impact on performance. LLMs like GPT, LaMDA, and PaLM have billions of parameters, raising questions about the consequences of their continued growth.

The journey of an LLM involves two stages: pre-training and scenario application. Pre-training focuses on optimizing the model using cross-entropy, while scenario application evaluates the model’s performance in specific use cases. Evaluating an LLM’s quality requires considering both stages, rather than relying solely on pre-training indicators.

Increasing training data, model parameters, and training time has been found to enhance performance in the pre-training stage. OpenAI and DeepMind have explored this issue, with OpenAI finding that a combination of more data and parameters, along with fewer training steps, produces the best results. DeepMind considers the amount of training data and model parameters equally important.

The influence of model size on downstream tasks varies. Linear tasks show consistent improvement as the model scales, while breakthrough tasks only benefit from larger models once they reach a critical scale. Tasks involving logical reasoning demonstrate sudden improvement at specific model scales. Some tasks exhibit U-shaped growth, where performance initially declines but then improves with larger models.

Reducing the LLM’s parameters while increasing training data proportionally can decrease the model’s size without sacrificing performance, leading to faster inference speed.

Understanding the impact of model size on both pre-training and downstream tasks is vital for optimizing LLM performance and exploring the potential of these language models.

Introduction

In recent years, we’ve witnessed a surge in the size of Large Language Models (LLMs), with models now boasting over 100 billion parameters becoming the new standard. Think OpenAI’s GPT-3 (175B), Google’s LaMDA (137B), PaLM (540B), and other global heavyweights. China, too, contributes to this landscape with models like Zhiyuan GLM, Huawei’s “Pangu,” Baidu’s “Wenxin,” etc. But here’s the big question: What unfolds as these LLMs continue to grow?

The journey of pre-trained models involves two crucial stages: pre-training and scenario application.

In the pre-training stage, the optimization goal is cross entropy. For autoregressive language models such as GPT, it is to see whether LLM correctly predicts the next word;

However, the real test comes in the scenario application stage, where specific use cases dictate evaluation criteria. Generally, our intuition is that if the LLM has better indicators in the pre-training stage, its ability to solve downstream tasks will naturally be stronger. However, this is not entirely true.

Existing research has proven that the optimization index in the pre-training stage does show a positive correlation with downstream tasks, but it is not completely positive. In other words, it is not enough to only look at the indicators in the pre-training stage to judge whether an LLM model is good enough. Based on this, we will look separately at these two different stages to see what the impact will be as the LLM model increases.

Part One: pre-training phase

First, let’s look at what happens as the model size gradually increases during the pre-training stage. OpenAI specifically studied this issue in “Scaling Laws for Neural Language Models” and proposed the “scaling law” followed by the LLM model.

Source: Scaling Laws for Neural Language Models

As shown in the figure above, this study proves that when we independently increase (1) the amount of training data, (2) model parameter size and (3) extend the model training time (such as from 1 Epoch to 2 Epochs), the Loss of the pre-trained model on the test set will decrease monotonically. In other words, the model’s effectiveness is improving steadily.

Since all three factors are important when we actually do pre-training, we have a decision-making problem on how to allocate computing power:

Question: Assuming that the total computing power budget used to train LLM (such as fixed GPU hours or GPU days) is given. How to allocate computing power?

Should we increase the amount of data and reduce model parameters?

Or should we increase the amount of data and model size at the same time but reduce the number of training steps?

Open AI

As one zero-sum game, the scale of one-factor increases, and the scale of other factors must be reduced to keep the total computing power unchanged, so there are various possible computing power allocation plans.

In the end, OpenAI chose to increase the amount of training data and model parameters at the same time but used an early stopping strategy to reduce the number of training steps. Because it proves that: for the two elements of training data volume and model parameters, if you only increase one of them separately, this is not the best choice. It is better to increase both at the same time according to a certain proportion. Its conclusion is to give priority to increasing the model parameters, and then the amount of training data.

Assuming that the total computing power budget used to train LLM increases by 10 times, then the amount of model parameters should be increased by 5.5 times and the amount of training data should be increased by 1.8 times. At this time, the model gets the best performance.

Deep Mind

A study by DeepMind (Reference: Training Compute-Optimal Large Language Models) explored this issue in more depth.

Source: Training Compute-Optimal Large Language Models

Its basic conclusions are similar to those of OpenAI. For example, it is indeed necessary to increase the amount of training data and model parameters at the same time, so that the model effect will be better.

Many large models do not consider this when doing pre-training. Many large LLM models were trained just monotonically increasing the model parameters while fixing the amount of training data. This approach is wrong and limits the potential of the LLM model.

However, DeepMind corrects the proportional relationship between the two by OpenAI and believes that the amount of training data and model parameters are equally important.

In other words, assuming that the total computing power budget used to train LLM increases by 10 times, the number of model parameters should be increased by 3.3 times, and the amount of training data should also be increased by 3.3 times to get the best model.

This means that increasing the amount of training data is more important than we previously thought. Based on this understanding, DeepMind chose another configuration in terms of computing power allocation when designing the Chinchilla model: compared with the Gopher model with a data volume of 300B and a model parameter volume of 280B, Chinchilla chose to increase the training data by 4 times, but reduced the model The parameters are reduced to one-fourth that of Gopher, which is about 70B. However, regardless of pre-training indicators or many downstream task indicators, Chinchilla is better than the larger Gopher.

This brings us to the following enlightenment:

We can choose to enlarge the training data and reduce the LLM model parameters in the same proportion to achieve the purpose of greatly reducing the size of the model without reducing the model performance.

Reducing the size of the model has many benefits, such as the inference speed will be much faster when applied. This is undoubtedly a promising development route for LLM.

Part Two: downstream tasks

The above is the impact of the model scale from the pre-training stage. From the perspective of the effect of LLM on solving specific downstream tasks, as the model scale increases, different types of tasks have different performances.

Source: Beyond the Imitation Game Benchmark

Specifically, there are the following three types of tasks.

(a) Tasks that achieve the highest linearity scores see model performance improve predictably with scale and typically rely on knowledge and simple textual manipulations.
(b) Tasks with high breakthroughs do not see model performance improve until the model reaches a critical scale. These tasks generally require sequential steps or logical reasoning. Around 5% of BIG-bench tasks see models achieve sudden score breakthroughs with increasing scale.
(c) Tasks that achieve the lowest (negative) linearity scores see model performance degrade with scale.

Linearity Tasks

The first type of task perfectly reflects the scaling law of the LLM model, which means that as the model scale gradually increases, the performance of the tasks gets better and better, as shown in (a) above.

Such tasks usually have the following common characteristics: they are often knowledge-intensive tasks. That is to say, if the LLM model contains more knowledge, the performance of such tasks will be better.

Many studies have proven that the larger the LLM model, the higher the learning efficiency. For the same amount of training data, the larger the model, the better the performance. This shows that even when faced with the same batch of training data, a larger LLM model is relatively more efficient in getting more knowledge than small ones.

What’s more, under normal circumstances, when increasing the LLM model parameters, the amount of training data will often increase simultaneously, which means that large models can learn more knowledge points from more data. These studies can explain the above figure, why as the model size increases, these knowledge-intensive tasks become better and better.

Most traditional NLP tasks are actually knowledge-intensive tasks, and many tasks have achieved great improvement in the past few years, even surpassing human performance. Obviously, this is most likely caused by the increase in the scale of the LLM model, rather than due to a specific technical improvement.

Breakthroughs Tasks

The second type of task demonstrates that LLM has some kind of “Emergent Ability”, as shown in (b) above. The so-called “emergent ability” means that when the model parameter scale fails to reach a certain threshold, the model basically does not have any ability to solve such tasks, which reflects that its performance is equivalent to randomly selecting answers. However, when the model scale spans Once the threshold is exceeded, the LLM model’s effect on such tasks will experience a sudden performance increase.

In other words, model size is the key to unlocking (unlocking) new capabilities of LLM. As the model size becomes larger and larger, more and more new capabilities of LLM will be gradually unlocked.

This is a very magical phenomenon because it means the following possibilities that make people optimistic about the future. Many tasks that cannot be solved well by LLM at present can be solved in future if we continue to make the model larger. Because LLM has “emergent capabilities” to suddenly unlock those limits one day. The growth of the LLM model will bring us unexpected and wonderful gifts.

The article “Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models” points out that tasks that embody “emergent capabilities” also have some common features: these tasks generally consist of multiple steps, and to solve these tasks, it is often necessary to first Multiple intermediate steps are solved, and logical reasoning skills play an important role in the final solution of such tasks.

Chain of Thought (CoT) Prompting is a typical technology that enhances the reasoning ability of LLM, which can greatly improve the effect of such tasks. I will discuss the CoT technology in the following blogs.

Here the most important question is, why does LLM have this “emergent ability” phenomenon? The article “Emergent Abilities of Large Language Models” shares several possible explanations:

Source: Emergent Abilities of Large Language Models

One possible explanation is that the evaluation indicators of some tasks are not smooth enough. For example, some metrics for generation tasks require that the string output by the model must completely match the standard answer to be considered correct otherwise it will be scored zero.

Thus, even as the model gradually becomes better and outputs more correct character fragments, because it is not completely correct, 0 points will be given for any small errors. Only when the model is large enough, the output Scores are scored when all the output segments are correct. In other words, because the indicator is not smooth enough, it cannot reflect the reality that LLM is actually gradually improving its performance on the task. It seems to be an external manifestation of “emergent ability”.

Another possible explanation is that some tasks are composed of several intermediate steps. As the size of the model increases, the ability to solve each step gradually increases, but as long as one intermediate step is wrong, the final answer will be wrong. This will also lead to this superficial “emergent ability” phenomenon.

Of course, the above explanations are still conjectures at present. As for why LLM has this phenomenon, further and in-depth research is needed.

U-shaped Tasks

Source: Inverse scaling can become U-shaped

There are also a small number of tasks. As the model size increases, the task effect curve shows U-shaped characteristics: as the model size gradually increases, the task effect gradually becomes worse, but when the model size further increases, the effect starts to get better and better. Figure above shows a U-shaped growth trend where the indicator trend of the pink PaLM model on the two tasks.

Why do these tasks appear so special? The article “Inverse Scaling Can Become U-shaped” gives an explanation:

These tasks actually contain two different types of subtasks, one is the real task, and the other is the “interference task ( distractor task)”.

When the model size is small, it cannot identify any sub-task, so the performance of the model is similar to randomly selecting answers.
When the model grows to a medium size, it mainly tries to solve the interference task, so it has a negative impact on the real task performance. This is reflected in the decline of the real task effect.
When the model size is further increased, LLM can ignore the interfering task and perform the real task, which is reflected in the effect starting to grow.

For those tasks whose performance has been declining as the model size increases, if Chain of Thought (CoT) Prompting is used, the performance of some tasks will be converted to follow the Scaling Law. That is, the larger the model size, the better the performance, while other tasks will be converted to a U-shaped growth curve.

This actually shows that this type of task should be a reasoning-type task, so the task performance will change qualitatively after adding CoT.

Personal View

Increasing the size of the LLM model may not seem technically significant, but it is actually very important to build better LLMs. In my opinion, the advancements from Bert to GPT 3 and ChatGPT are likely attributed to the growth of the LLM model size rather than a specific technology. I believe a lot of people want to explore the scale ceiling of the LLM model if possible.

The key to achieving AGI may lie in having large and diverse data, large-scale models, and rigorous training processes. Developing such large LLM models requires high engineering skills from the technical team, which means there is technical content involved.

Increasing the scale of the LLM model has research significance. There are two main reasons why it is valuable.

Firstly, as the model size grows, the performance of various tasks improves, especially for knowledge-intensive tasks. Additionally, for reasoning and difficult tasks, the effect of adding CoT Prompting follows a scaling law. Therefore, it is important to determine to what extent the scale effect of LLM can solve these tasks.
Secondly, the “emergent ability” of LLM suggests that increasing the model size may unlock new capabilities that we did not expect. This raises the question of what these capabilities could be.

Considering these factors, it is necessary to continue increasing the model size to explore the limits of its ability to solve different tasks.

Talk is cheap, and in reality, very few AI/ML practitioners have the opportunity or ability to build larger models due to high financial requirements, investment willingness, engineering capabilities, and technical enthusiasm from research institutions. There are probably no more than 10 institutions that can do this on Earth. However, in the future, there may be a possibility of joint efforts between capable institutions to build a Super-Large model:

All (Resources) for One (Model) and One (Model) for All (People).
Modified from Alexandre Dumas, The Three Musketeers

Reference

OpenAI 2020: Scaling Laws for Neural Language Models (https://arxiv.org/abs/2001.08361)
DeepMind 2022: Training Compute-Optimal Large Language Models (https://arxiv.org/abs/2203.15556)
BIG-bench Project Team: 2023: Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models (https://arxiv.org/abs/2206.04615)
Google 2023: Inverse scaling can become U-shaped (https://arxiv.org/abs/2211.02011)

What’s Next?

Technical Review 04: Human-Computer Interface: From In Context Learning to Instruct Understanding (ChatGPT)

Previous Blogs

Google Cloud Professional Machine Learning Engineer Exam Prep Guide and Study Tips

Content created by the author and reviewed by GPT.

Obtaining the Google Cloud Professional Machine Learning Engineer (MLE) certification is a remarkable achievement for those interested in a machine learning career. As someone who recently passed the exam, I’m here to share helpful tips and insights about the journey. Whether you’re considering taking the exam or currently preparing for it, I hope this guide will help you with valuable information based on my experience.

Three Steps in Preparation

Step 1: Read the Exam Guide Thoroughly

Before diving into your exam preparation, start by carefully reading the official Exam Guide provided by Google. This document is the roadmap for us to understand the key topics and expectations for the certification.

It’s essential to have a clear grasp of what the exam covers before we begin our study journey. Revisit the ML basics via Google’s Crash Course to clarify the details.

Step 2: Learn Best Practices for Implementing ML on Google Cloud

Machine learning is a dynamic field with various approaches and techniques. Google provides best practices for implementing ML solutions on their platform, and this practical knowledge is invaluable. Learning these best practices will not only help us in the exam but also equip us with the skills necessary for real-world ML projects.

Official Google documents, which include keywords such as best practice, machine learning solution, and data pipeline, are all worth reading.

Step 3: Consult the ExamTopic Website

The ExamTopic website is a valuable resource for exam preparation. However, it’s essential to use it strategically. This resource is not a “cheat sheet” or a “shortcut” to the exam, so save it for later, like after we’ve refreshed our knowledge through reading the official documentation and best practices.

While ExamTopic can provide insights into potential exam questions, remember that there are no official answers. The answers offered on the web and those voted ones by users may not be correct.

Get Ready for Exam and Study Tips

Exam Online or Onsite
- There are two ways to take the exam: Online and Onsite. If you choose the online option, make sure your home WIFI is stable and your system is checked (webcam, microphone, Secure Browser).
- You will be asked to adjust your device’s security settings, such as turning off the Firewall or enabling screen sharing. If you’re not comfortable making these changes, consider booking an Onsite Exam.
- If any issues arise during the exam, don’t panic! Just contact Kryterion support team through Live Chat. They can help with things like reopening the launch button for you or adjusting the time.
- The key is to stay calm and reach out for help if needed to ensure a smooth exam experience!
Reading vs. Watching
- In the age of abundant online resources, it’s tempting to jump straight into video tutorials and courses. However, for the best retention of knowledge, start by actively reading Google’s documentation.
- Passive learning through watching videos may lead to omitted details. Reading engages your mind and helps you absorb information effectively.
Understand Trade-offs
- Machine learning involves making critical decisions, such as balancing speed and accuracy. Take the time to understand the trade-offs involved in various ML solutions. This understanding will prove invaluable not only in the exam but also in real-world ML projects.
Reading Comprehension
- During the exam, we will encounter questions that provide background information on a problem, stakeholder expectations, and resource limitations. Treat these questions like reading comprehension exercises, as key details hidden within can guide us to the correct answer. Pay close attention to keywords that may hold the solution.
Time Management
- The exam requires answering 60 questions within a limited timeframe like 2 hours, which may vary in the future. Manage our time wisely by marking questions we’re unsure about for review later.
- Prioritize the questions we can confidently answer first and revisit the marked ones before submitting our exam in the end.
Stress Management
- Even if you tell yourself not to stress, it’s natural to feel some pressure during the exam.
- Consider conducting simulated practice exams to strengthen your nerves, especially in the case that you haven’t taken any exam for a long time. This practice can help improve your mental preparedness for the actual exam.

In the end, I wish you the best of luck in your journey towards achieving the Google Cloud Professional Machine Learning Engineer certification. Remember that diligent preparation, careful reading, and a strategic approach to resources can significantly enhance your chances of success.

Stay confident, stay focused, and may you pass the exam as soon as possible!

-END-

Encourage the Author to create more useful and interesting articles.

All the money will be donated to the Standford Rural Area Education Program (https://sccei.fsi.stanford.edu/reap) at the end of each Financial Year + my personal donation.

$2.00

Deep Learning Recommender System – Part 1: Technical Framework

[2023-06 Update Review]: add more details about related terms for better understanding.

In this blog, I will review the Classic Technical Framework of the modern (Deep Learning) recommendation system (aka. recommender system).

Before Starting

Before I start, I want to ask readers a question:

What is the first thing you want to do when you start learning a new field X?

Of course, everyone has their own answers, but for me, there are two questions I want to know the most at the beginning. Here X is the recommender system.

What problem is this X trying to solve?

Is there a high-level mind map so that I can understand the basic concepts, main technologies and development requirements in this X?

Moreover, for the field of “deep learning recommendation system”, there may be a third question.

Why do people keep emphasizing “deep learning“, and what revolutionary impact does deep learning bring to the recommendation system?

I hope you will find answers to these three questions after reading this blog.

What is the fundamental problem to be solved by a recommender system?

The applications of recommender systems have gotten into all aspects of life, such as shopping, entertainment, and learning. Although the recommendation scenarios such as product recommendation, video recommendation, and news recommendation may be completely different, since they are all called “recommender systems”, the essential problem to be solved must be the same and follow a common logical framework.

The problem to be solved by the recommender system can be summed up in one sentence:

In the information overload era, how can Users efficiently obtain the Items of their interest?

Therefore, the recommendation system is a bridge built between “the overloaded Internet information” and “users’ interests“.

Let’s look at the recommender system’s abstract logical architecture, and then build its technical architecture step by step, so you can have an overall impression of the entire system.

The logical architecture of the recommender system

Starting from the fundamental problem of the recommendation system, we can clearly see that the recommendation system is actually dealing with the relationship between “people” and “information“. That is, based on “people” and “information” to construct a method of finding interesting information for the people.

User – From the perspective of “people“, in order to more reliably infer the interests of “people”, the recommender system hopes to use a large amount of information related to “people”, including historical behaviour, population attributes, relationship networks, etc. They may be collectively referred to as “User Information“.

Item – The definition of “information” has specific meanings and diverse interpretations in different scenarios. For example, it refers to “product information” in product recommendations, “video information” in video recommendations, and “news information” in news recommendations. We can collectively refer to them as “Item Information” for convenience.

Content – In addition, in a specific recommendation scenario, the user’s final selection is generally affected by a series of environmental information such as time, location, and user status, which can also be called “scene information” or “context information”.

With these definitions, the problem to be dealt with by the recommender system can be formally defined as:

For a certain user U (User), in a specific scenario C (Context), build a function for massive “item” information, predict the user’s preference for a specific candidate item I (Item), and then sort all candidate items according to the preference to generate a recommendation list.

In this way, we can abstract the logical framework of the recommender system, as shown in Figure 1. Although this logical framework is relatively simple, it is on this simple basis that we can refine and expand each module to produce the entire technical system.

**Figure 1**. The logical architecture diagram of the recommender system includes the Candidate Item, the Recommender System Model of the User, the Item and Content, and the final Recommendation List.

The Revolution of Deep Learning for Recommender Systems

With the logical architecture of the recommender system (Figure 1), we can answer the third question from the beginning:

“What revolutionary impact does deep learning bring to the recommender system?“

In the logic architecture diagram, the central position is an abstract function f(U, I, C), which is responsible for “guessing” the user’s heart and “scoring” the items that the user may be interested in so as to obtain the final recommended item list. In the recommender system, this function is generally referred to as the “recommendation system model” (hereinafter referred to as the “recommendation model“).

The application of deep learning to recommendation systems can greatly enhance the fitting and expressive capabilities of recommendation models. Simply put, deep learning aims to make the recommendation model “guess more accurately” and better capture the “heart” of users.

So you may still not have a clear concept. Next, let’s compare the difference between the traditional machine learning recommendation model and the deep learning recommendation model from the perspective of the model structure so that you can have a clearer understanding.

Here is a model structure comparison chart in Figure 2, which compares the difference between the traditional Matrix Factorization model and the Deep Learning Matrix Factorization model.

**Figure 2**. Traditional **Matrix Factorization** model vs Deep Learning Matrix Factorization model (**Neural collaborative filtering**). Source: https://arxiv.org/abs/1708.05031

Let’s ignore the details for now. How do you feel at first glance?

Do you feel that the deep learning model has become more complex, layer after layer, and the number of layers has increased a lot?

In addition, the flexible model structure of the deep learning model also gives it an irreplaceable advantage; that is, we can let the neural network of the deep learning model simulate the changing process of many user interests and even the thinking process of the user making a decision.

For example, Alibaba‘s deep learning model, Deep Interest Evolution Network (Figure 3 DIEN), uses the structure of a three-layer sequence model to simulate the process of users’ interest evolution when purchasing goods, while such a powerful data fitting ability to interpret user behaviour is not available in traditional machine learning models.

**Figure 3**. Alibaba’s **Deep Interest Evolution Network (DIEN) for Click-Through Rate (CTR) Prediction**. AUGRU (GRU with attentional update gate) models the interest-evolving process that is relative to the target item. GRU denotes Gated Recurrent Units, which overcomes the vanishing gradients problem of RNN and is faster than LSTM. Source: https://arxiv.org/abs/1809.03672.

Moreover, the revolutionary impact of deep learning on recommender systems goes far beyond that. In recent years, due to the greatly increased structural complexity of deep learning models, more and more data streams are required to converge for the model training, testing and serving. The data storage, processing and updating modules related to the recommender systems on the cloud computing platforms have also entered the “deep learning era”.

After talking so much about the impact of deep learning on recommender systems, we seem to have not seen a complete deep learning recommender system architecture. Don’t worry. Let’s talk about what the technical architecture of a classic deep-learning recommender system looks like in the next section.

Technical Architecture of Deep Learning Recommendation System

The architecture of the deep learning recommender system is in the same line as the classic one. It improves some specific modules of the classic architecture to enable and support the application of deep learning. So, I will first talk about the classic recommender system architecture and then talk about deep learning and its improvements.

In the actual recommender system, there are two types of problems that engineers need to focus on in projects.

Type 1 is about data and information. What are “user information”, “item information”, and “scenario information”? How to store, update and process these data?

Type 2 is on recommendation algorithms and models. How to train, predict, and achieve better recommendation results for the system?

An industrial recommendation system’s technical architecture is based on these two parts.

The “data and information” part has gradually developed into a data flow framework that integrates offline (nearline) batch processing of data and real-time stream processing in the recommender system.

The “model and algorithm” part is further refined into a model framework that integrates training, evaluation, deployment, and online inference in the recommender system.

Based on this, we can summarize the technical architecture diagram of the recommender system as in Figure 4.

Figure 4. Diagram of the technical architecture of the recommendation system

In Figure 4, I divided the technical architecture of the recommender system into a “data part” and a “model part”.

Part 1: Data Framework

The “data part” of the recommender system is mainly responsible for the collection and processing of “user“, “item” and “content” information. Based on the difference in the amount of data and the real-time processing requirements, we use three different data processing methods sorted by order of real-time performance, they are:

Client and Server end-to-end real-time data processing.
Real-time stream data processing.
Big data offline processing.

From 1 to 3, the real-time performance of these methods decreases from strong to weak, and the massive data processing capabilities of the three methods increase from weak to strong.

Method	Real-Time Performance	Data Processing Capability	Possible Solutions
Client and Server end-to-end real-time data processing	Strong	Weak	Flink
Real-time stream data processing	Medium	Medium	…
Big data offline processing	Weak	Strong	Spark

Therefore, the data flow system of a mature recommender system will complement each other and use them together.

The big data computing platform (e.g., AWS, Azure, GCP, etc.) of the recommender system can extract training data, feature data, and statistical data through the processing of the system logs and metadata of items and users. So what are these data for?

Specifically, there are three downstream applications based on the data exported from the data platform:

Generate the Sample Data required by the recommender system model for the training and evaluation of the algorithm model.
Generate “user features“, “item features” and a part of “content features” required by the recommendation system model service (Model Serving) for online inference.
Generate statistical data required for System Monitoring and Business Intelligence (BI) systems.

The data part is the “water source” of the entire recommender system. Only by ensuring the continuity and purity of the “water source” can we continuously “nourish” the recommender system so that it can operate efficiently and output accurately.

In the deep learning era, models have higher requirements for “water sources”. First of all, the amount of water must be large. Only in this way can we ensure that the deep learning models we build can converge as soon as possible; Secondly, the “water flow” should be fast so that the data can flow to the system modules for model updates and adjustments. Thus, the model can grasp the changes in user interest in real time. This is the same reason causing the rapid development and application of big data engine (Spark) and the stream computing platform (Flink).

Part II: Model Framework

The “model part” is the major body of the recommender system. The structure of the model is generally composed of a “recall layer“, a “ranking layer”, and a “supplementary (auxiliary) strategy and algorithm layer“.

The “recall layer” is generally composed of efficient recall rules, algorithms or simple models, which allow the recommender system to quickly recall items that users may be interested in from a massive candidate set.
The “ranking layer” uses the ranking/sorting model(s) to fine-sort-rank the candidate items that are initially screened by the recall layer.
The “supplementary strategy and algorithm layer“, also known as the “re-ranking layer“, is a combination of some supplements to take into account indicators such as “diversity“, “popularity” and “freshness” of the results before returning to the user recommendation list. The strategy and algorithm make more adjustments to the item list and finally form a user-visible recommendation list.

The “model serving process” means the recommender system model receives a full candidate item set and then generates the recommendation list.

In order to optimise the model parameters required by the model service process, we need to determine the model structure, the specific values of the different parameter weights in the structure, and the parameter values in the model-related algorithms and strategies through model training.

The training methods can be divided into “offline training” and “online updating” according to different environments.

The advantage of offline training is that the optimizer can use the full samples and all the features to build the model approach to the global optimal performance.
While online updating can “digest” new data samples in real-time, learn and reflect new data trends more quickly, and meet the real-time recommending requirements.

In addition, in order to evaluate the performance of the recommender system model and optimize the model iteratively, the model part of the recommender system also includes various evaluation modules such as “offline evaluation” and “online A/B test“, which are used to obtain offline and online indicators to guide the model iteration and optimization.

We just said that the revolution of deep learning for recommender systems is in the model part, so what are the specifics?

I summarized the most typical deep learning applications into 3 points:

The application of embedding technology in the recall layer in deep learning. Embedding technology of deep learning is already a mainstream solution in the industry to support the recall layer to quickly generate user-related items.
The application of deep learning models with different structures in the ranking layer. The ranking layer (also known as the fine sorting layer) is the most important factor affecting the system’s performance, and it is also the area where deep learning models show their strengths. The deep learning model has high flexibility and strong expressive ability, making it suitable for accurately sorting under large data volumes. Undoubtedly, the deep learning ranking model is a hot topic in both industry and academia. It will keep gaining investments and being rapidly iterated by researchers and engineers.
The application of reinforcement learning in the direction of model updating and integration (CI/CD). Reinforcement learning is another field of machine learning closely related to deep learning. Its application in recommender systems enables the systems to take a higher level of real-time performance.

Summary

In this blog, I reviewed the technical architecture of the deep learning recommendation system. Although it involves a lot of content, you don’t have to worry about it if you cannot remember all the details. All you need is to keep the impression of this framework in your mind.

You can use the content of this framework as a technical index of a recommender system, making it your own knowledge map. Visually speaking, you can think of the content of this blog as a tree of knowledge, which has roots, stems, branches, leaves, and flowers.

Let’s recall the most important concepts again:

The root is that the recommender system aims to solve the challenge of how to help users efficiently obtain the items of interest in this “information overload” era.
The stems of the recommender system are the logical architecture of the recommendation system: for a user U (User), in a specific scenario C (Context), a function is constructed for a large number of “items” (products, videos, etc.) to predict the user’s response to a specific candidate item I (Item). ) with the degree of preference.
The branches and leaves are the technical modules of the recommender system and the algorithms/models of each module, respectively. The technical module is responsible for supporting the system’s technical architecture. Using algorithms and models, we can create diverse functions within the system and provide accurate results.

Finally, the application of deep learning is undoubtedly the pearl of the current technical architecture of recommender systems. It is like a flower blooming on this big tree, and it is the most wonderful finishing touch.

The structure of the deep learning model is complex, but its data fitting ability and expression ability is stronger, so the recommender model can better simulate the user’s interest-changing process and even the decision-making process. The development of deep learning has also promoted a revolution in the data flow framework, leading to higher requirements for cloud computing service providers to process the data faster and stronger.

Hope you like this blog. To be continued the Next Blog will be Deep Learning Recommender Systems Part 2: Feature Engineering.

One more thing…

Figure 5 is the recommender system of Netflix. Here is the challenge, can you combine the technical framework of the recommender system discussed in this blog to tell which parts are the data part and which are the model part in the diagram?

Figure 5. Diagram of Netflix recommender system architecture

Reference:

[1] 极客时间《深度学习推荐系统实战》, 王喆 Roku 推荐系统架构负责人，前 hulu 高级研究员，《深度学习推荐系统》作者.
[2] He, Xiangnan, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. “Neural collaborative filtering.” In Proceedings of the 26th international conference on world wide web, pp. 173-182. 2017.
[3] Zhou, Guorui, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou, Xiaoqiang Zhu, and Kun Gai. “Deep interest evolution network for click-through rate prediction.” In Proceedings of the AAAI conference on artificial intelligence, vol. 33, no. 01, pp. 5941-5948. 2019.

Risk Level Calculation for Contact Tracing: an Example of Apple IOS framework

You know in Australia there is a ‘Covidsafe app‘ for everyone.

The COVIDSafe app speeds up contacting people exposed to coronavirus (COVID-19). This helps us support and protect you, your friends and family. Please read the content on this page before downloading. At the end of the Australian COVID-19 pandemic, users will be prompted to delete the COVIDSafe app from their phone. This will delete all app information on a person’s phone. The information contained in the information storage system will also be destroyed at the end of the pandemic.

Here is the introduction video:

So, all those descriptions are trying to tell your information is safe and your privacy is protected with this app. By the way, the COVIDSafe app is the only contact tracing app approved by the Australian Government. I think this means it is the first official one.

This post is for viewers who want to understand a little bit deeper technical details about the technology used in this app. I will quote the document from Apple and keep it as simple as possible. I am not an IOS developer. I am just as curious as you, trying to understand how it measures the risk. And I am not sure if the COVIDSafe app used apple’s framework, LOL~

My only sources are from the webpages of the Australian Government Department of Health and Apple [iOS Framework Document] Exposure Notification April 2020. You can click these Keywords to learn more background knowledge around this app: COVIDSafe, Mesh Network; GDPR; DP3T; Beacon.

So, according to Apple’s document, the following diagram illustrates the general format of Exposure Risk Level Calculation:

Example Contact Tracing Apple IOS 03

Exposure Risk Level Parameters

Transmission Risk — An app-defined flexible value to tag a specific positive key. This value could be tagged based on symptoms, level of diagnosis verification, or other determination from the app or health authority.
Duration (measured by API) — Cumulative duration of the exposure. Days (measured by API) – Days since the exposure incident.
Attenuation (measured by API) – Minimum Bluetooth signal strength attenuation (Transmission Power subtract RSSI).
Level Value: The value, ranging from 1 to 8, that the app assigns to each Level in each of the Exposure Risk Level Parameters.
Level: The eight levels contained within each Exposure Risk Level Parameter.

Exposure Risk Level Parameter Weights (A, B, C, D)

The weights defined by the app (ranging from 0-100) that assign the relative importance to each of the Exposure Risk Level Parameters.

Continue reading “Risk Level Calculation for Contact Tracing: an Example of Apple IOS framework”

How to Build an Artificial Intelligent System (II)

This post is following upgrade with respect to the early post How to Build an Artificial Intelligent System (I) The last one is focused on introducing the six phases of the building an intelligent system, and explaining the details of the Problem Assesment phase.

In the following content, I will address the rest phases and key steps during the building process. Readers can download the keynotes here: Building an Intelligent System with Machine Learning.

Continue reading “How to Build an Artificial Intelligent System (II)”

How to Build an Artificial Intelligent System (I)

Phase 1: Problem assessment – Determine the problem’s characteristics.

What is an intelligent system?

The process of building Intelligent knowledge-based system has been called knowledge engineering since the 80s. It usually contains six phases: 1. Problem assessment; 2. Data and knowledge acquisition; 3. Development of a prototype system; 4. Development of a complete system; 5. Evaluation and revision of the system; 6. Integration and maintenance of the system [1].

Continue reading “How to Build an Artificial Intelligent System (I)”

Learning From Data and Three Principles

The learning problem and the principles before building a model.

This blog is mainly based on the book and lecture notes by Professor Yaser S. Abu-Mostafa from Caltech on Learning from data; you could benefit greatly from the lecture and videos.

Learning Problem:

“In God we trust, and others bring data”.

If you show a picture to a three-year-old and ask if there is a tree, you will likely get the correct answer. But if you ask a thirty-year-old what the definition of a tree is, you will likely get an inconclusive answer.

We didn’t learn what a tree is by studying the mathematical definition of a tree. We knew it by looking at the trees. In other words, we learn from ‘Data’.