
Editor’s note: This is an edited extract from AI
Crash Course, by Hadelin de Ponteves, published by Packt. Find out more and
buy a copy of the book by visiting here.
When people refer to AI today, some of them think of Machine
Learning, while others think of Reinforcement Learning. I fall into the second
category. I always saw Machine Learning as statistical models that have the
ability to learn some correlations, from which they make predictions without
being explicitly programmed. While this is, in some way, a form of AI, Machine
Learning does not include the process of taking actions and interacting with an
environment like we humans do. Indeed, as intelligent human beings, what we
constantly keep doing is the following:
- We observe some input, whether it’s what we see
with our eyes, what we hear with our ears, or what we remember in our memory - These inputs are then processed in our brain
- Eventually, we make decisions and take actions.
This process of interacting with an environment is what we
are trying to reproduce in terms of Artificial Intelligence. And to that
extent, the branch of AI that works on this is Reinforcement Learning. This is
the closest match to the way we think; the most advanced form of Artificial
Intelligence, if we see AI as the science that tries to mimic (or surpass)
human intelligence
This process of interacting with an environment is what we
are trying to reproduce in terms of Artificial Intelligence. And to that extent,
the branch of AI that works on this is Reinforcement Learning. This is the
closest match to the way we think; the most advanced form of Artificial
Intelligence, if we see AI as the science that tries to mimic (or surpass)
human intelligence.
Reinforcement
Learning also has the most impressive results in business applications of
AI. For example, Alibaba leveraged Reinforcement Learning to increase its ROI
in online advertising by 240% without increasing their advertising budget (see
https://arxiv.org/pdf/1802.09756.pdf, page 9, Table 1 last row (DCMAB)).
The five principles of reinforcement learning
Let’s begin building the first pillars of your intuition
into how Reinforcement Learning works. These are the fundamental principles of
Reinforcement Learning, which will get you started with the right, solid basics
in AI.
Here are the five principles:
- Principle #1: The input and output system
- Principle #2: The reward
- Principle #3: The AI environment
- Principle #4: The Markov decision process
- Principle #5: Training and inference
Principle #1 – The input and output system
The first step is to understand that today, all AI models
are based on the common principle of inputs and outputs. Every single form of
Artificial Intelligence, including Machine Learning models, ChatBots,
recommender systems, robots, and of course Reinforcement Learning models, will
take something as input, and will return another thing as output.

In Reinforcement Learning, these inputs and outputs have a
specific name: the input is called the state, or input state. The output is the
action performed by the AI. And in the middle, we have nothing other than a
function that takes a state as input and returns an action as output. That
function is called a policy. Remember the name, “policy,” because you
will often see it in AI literature.
As an example, consider a self-driving car. Try to imagine
what the input and output would be in that case.
The input would be what the embedded computer vision system
sees, and the output would be the next move of the car: accelerate, slow down,
turn left, turn right, or brake. Note that the output at any time (t) could
very well be several actions performed at the same time. For instance, the self-driving
car can accelerate while at the same time turning left. In the same way, the
input at each time (t) can be composed of several elements: mainly the image
observed by the computer vision system, but also some parameters of the car
such as the current speed, the amount of gas remaining in the tank, and so on.
That’s the very first important principle in Artificial
Intelligence: it is an intelligent system (a policy) that takes some elements
as input, does its magic in the middle, and returns some actions to perform as
output. Remember that the inputs are also called the states. The next important
principle is the reward.
Principle #2 – The reward
Every AI has its performance measured by a reward system.
There’s nothing confusing about this; the reward is simply a metric that will
tell the AI how well it does over time.
The simplest example is a binary reward: 0 or 1. Imagine an
AI that has to guess an outcome. If the guess is right, the reward will be 1,
and if the guess is wrong, the reward will be 0. This could very well be the
reward system defined for an AI; it really can be as simple as that!
A reward doesn’t have to be binary, however. It can be
continuous. Consider the famous game of Breakout:

Imagine an AI playing this game. Try to work out what the
reward would be in that case. It could simply be the score; more precisely, the
score would be the accumulated reward over time in one game, and the rewards
could be defined as the derivative of that score.
This is one of the many ways we could define a reward system
for that game. Different AIs will have different reward structures; we will
build five rewards systems for five different real-world applications in this
book.
With that in mind, remember this as well: the ultimate goal
of the AI will always be to maximize the accumulated reward over time.
Those are the first two basic, but fundamental, principles
of Artificial Intelligence as it exists today; the input and output system, and
the reward. The next thing to consider is the AI environment.
Principle #3 – The AI environment
The third principle is what we call an “AI
environment.” It is a very simple framework where you define three things
at each time (t):
- The input (the state)
- The output (the action)
- The reward (the performance metric)
For each and every single AI based on Reinforcement Learning
that is built today, we always define an environment composed of the preceding
elements. It is, however, important to understand that there are more than
these three elements in a given AI environment.
For example, if you are building an AI to beat a car racing
game, the environment will also contain the map and the gameplay of that game.
Or, in the example of a self-driving car, the environment will also contain all
the roads along which the AI is driving and the objects that surround those
roads. But what you will always find in common when building any AI, are the
three elements of state, action, and reward. The next principle, the Markov
decision process, covers how they work in practice.
Principle #4 – The Markov decision process
The Markov decision process, or MDP, is simply a process
that models how the AI interacts with the environment over time. The process
starts at t = 0, and then, at each next iteration, meaning at t = 1, t
= 2, … t = n units of time (where the unit can be anything, for
example, 1 second), the AI follows the same format of transition:
- The AI observes the current state, sᵣ
- The AI performs the action, aᵣ
- The AI receives the reward, rᚁ = R(Sᚁ, aᚁ)
- The AI enters the following state, Sᚁ +1
The goal of the AI is always the same in Reinforcement
Learning: it is to maximize the accumulated rewards over time, that is, the sum
of all the rᚁ
= R (Sᚁ, aᚁ) received at each transition.
The following graphic will help you visualize and remember
an MDP better, the basis of Reinforcement Learning models:

Now four essential pillars are already shaping your
intuition of AI. Adding a last important one completes the foundation of your
understanding of AI. The last principle is training and inference; in training,
the AI learns, and in inference, it predicts.
Editor’s note: Find out about the last principle of Reinforcement Learning and much more by ordering a copy of AI Crash Course, available here.About the author: Hadelin de Ponteves is the co-founder and director of technology at BlueLife AI, which leverages the power of cutting-edge Artificial Intelligence to empower businesses to make massive profits by optimizing processes, maximizing efficiency, and increasing profitability. Hadelin is also an online entrepreneur who has created 50+ top-rated educational e-courses on topics such as machine learning, deep learning, artificial intelligence, and blockchain, which have reached over 700,000 subscribers in 204 countries.

Interested in hearing industry leaders discuss subjects like this? Attend the co-located 5G Expo, IoT Tech Expo, Blockchain Expo, AI & Big Data Expo, and Cyber Security & Cloud Expo World Series with upcoming events in Silicon Valley, London, and Amsterdam.