Definitive Guide To Reinforcement Learning

Reinforcement learning is one of the most popular types of Machine Learning Algorithm where an agent learns to behave in an environment by performing actions and analysing the results from that action. The machine learning model is trained to make a sequence of decisions. The agent or the model learns to achieve a goal in an uncertain and potentially complex environment. 

Currently this is one of the hot topic in Artificial Intelligence domain now and we are seeing a lot of improvements in this intriguing area of research. Examples:

  1. DeepMind and the Deep Q learning architecture 
  2. beating the champion of the game of Go with AlphaGo 
  3. OpenAI and the PPO

Understanding Reinforcement Learning With Example (The Game of PacMan ) 

In reinforcement learning, artificial intelligence faces a game-like situation. The computer employs trial and error and comes up with a solution to the problem faced. In Order to get the machine to do what the programmer actually wants, the artificial intelligence is given either rewards or penalties depending on the action it performs. The ultimate goal will be to maximize the reward. 

Let’s take the game of PacMan as an example where the goal of the agent that is the PacMan is to eat the food in the grid while avoiding the ghosts on its way. The grid world is the interactive environment set up for the agent. PacMan receives a reward for eating food and punishment if it gets killed by the ghosts. The location of PacMan in the grid world are the states and the total cumulative reward is PacMan winning the game.

Reinforcement Learning Fig3 Pacman

By leveraging the power of search and many trials, reinforcement learning has become the most effective way to comment upon a machine’s creativity. If a reinforcement learning algorithm is run on a sufficiently powerful computer infrastructure, Artificial intelligence can gather experience from thousands of parallel gameplays.

6 Key Terms Related To Reinforcement Learning

  1. Environment:
    It is the physical world in which the agent  moves, operates and which responds to the agent. It takes the agent’s current state and action as input, and returns the agent’s reward and next state as output. 

  2. State:
    It is the immediate situation of the agent. It can be technically seen as an instantaneous configuration that puts the agent in relation to other significant things such as prizes, enemies, tools and obstacles. 

  3. Reward:
    It is the feedback given from the environment which is used to measure the success or failure of an agent’s actions in a particular state. Rewards can be immediate or delayed. They can be used to effectively evaluate the agent’s action.

  4. Policy:
    It is a method to map an agent’s state to actions (actions that promise the highest reward). It can also be understood as a  strategy that an agent employs to determine the next action based on the current state. 

  5. Discount Factor:
    A discount factor is multiplied by future rewards as discovered by the agent in order to dampen the rewards effect on the agent’s choice of action. Intermediate rewards will have more rewards than future ones. In this manner, it enforces a kind of short-term hedonism in the agent. A discount factor of 1 would make future rewards equivalent worth of immediate rewards. 

  6. Value:
    It is the future reward that an agent would receive by taking an action in a particular state. It can be seen as the expected long-term return with a discount.

Reinforcement Learning Fig. 1

Tasks in Reinforcement Learning

A task is said to be an instance of a Reinforcement Learning problem. We have two types of tasks:
1. Episodic Task
2. Continuous Task

1. Episodic task

An episode gets created when we have a starting point and an ending point (terminal state). An episodic task lasts for a finite amount of time. It has at least one finite state. In an episodic task, there may be only a single reward, at the end of the task, and one significant  option is to distribute the reward evenly across all actions taken in that episode.
Playing a single game of Go is an example of an episodic task, which you win or lose.

2. Continuous tasks

A continuous task is a task that never ends. It doesn’t even have a finite state. It continues forever as there is no terminal state.  The reward is not going to be given at the end, as there is no end, but every so often during the task. In a continuous task, rewards may be assigned along with discounting. So more recent reactions receive greater rewards.
Reading the internet to learn mathematics could be considered as an example of a continuous task.

How can we learn Reinforcement Learning?

We have two ways of learning Reinforcement Learning:

  1. Monte Carlo:
    Basically Collect the rewards finally at the end of the episode and then calculating the maximum expected future reward.

  2. Temporal Difference Learning:
    Calculate the rewards at each step and take action based on that.

Monte Carlo

In the Monte Carlo approach, When the episode ends , the agent looks at the total cumulative reward to see how well it did. The rewards are only received at the end of the game. Each game adds to the knowledge of the agent.It makes better decisions with each iteration.

Temporal Difference Learning 

TD Learning will not wait until the end of the episode for updating the maximum expected future reward estimation. It will update its value estimation V for the non-terminal states occurring at that experience.

Types of Reinforcement Learning

There are two types of Reinforcement:

  1. Positive
    If the strength and the frequency of the behavior increases due to a particular behavior in the occurrence of an event, It is known as positive Reinforcement learning. It has a positive effect on behavior.It maximizes Performance and sustains the Change for a long period of time.

  2. Negative –
    Negative Reinforcement is defined as strengthening of a behavior because a negative condition is avoided or stopped.It increases the behavior and provides defiance to the minimum standard of performance.

What Challenges do we face in reinforcement learning?

The first challenge in reinforcement learning poses in preparing the simulation environment, which is highly dependent on the task that is to be performed. Things become difficult when it comes to transferring the model out of the training environment into the real world.

The next major challenge is scaling and tweaking the neural network that is controlling the agent. The only way to communicate with the system is  through the system of rewards and penalties.
Finally, some agents will optimize the prize without performing the task it was designed for. This is also a major challenge in reinforcement learning.


Reinforcement learning consists of goal-oriented algorithms, which learn how to attain a complex goal. It can start from a blank slate, and under the right conditions, can attain superhuman performance. Reinforcement algorithms that incorporate deep neural networks can even beat human experts in various games. The potential of  Reinforcement Learning  is immense. Hence, It can be the hope of true artificial intelligence. You can also check out our post on 8 Neural Network Architectures Machine Learning Researchers Need to Learn.

Spread the knowledge


A passionate programmer and a machine learning enthusiast who wishes to explore emerging fields of technologies.

Leave a Reply

Your email address will not be published. Required fields are marked *