Reinforcement learning is one of the most popular type of Machine Learning Algorithm where an agent (i.e the model) learns to behave in a environment by performing actions and analysing the results from that action.
Currently this domain is the hot topic now and we’ve seen a lot of improvements in this fascinating area of research.Examples:
- DeepMind and the Deep Q learning architecture in 2014
- beating the champion of the game of Go with AlphaGo in 2016
- OpenAI and the PPO in 2017
Reinforcement Learning is said to be the hope of true artificial intelligence. And it is rightly said so, because the potential that Reinforcement Learning possesses is immense.
Let’s imagine an agent learning to play Super Mario Bros as a working example. The Reinforcement Learning (RL) process can be modeled as a loop that works like this:
- Our Agent receives state S0 from the Environment (In our case we receive the first frame of our game (state) from Super Mario Bros (environment))
- Based on that state S0, agent takes an action A0 (our agent will move right)
- Environment transitions to a new state S1 (new frame)
- Environment gives some reward R1 to the agent (not dead: +1)
This RL loop outputs a sequence of state, action and reward.
The goal of the agent is to maximize the expected cumulative reward.
Some key terms that describe the elements of a RL problem are:
- Environment: Physical world in which the agent operates
- State: Current situation of the agent
- Reward: Feedback from the environment
- Policy: Method to map agent’s state to actions
- Value: Future reward that an agent would receive by taking an action in a particular state
A Reinforcement Learning problem can be best explained through games.
Let’s take the game of PacMan where the goal of the agent (PacMan) is to eat the food in the grid while avoiding the ghosts on its way. The grid world is the interactive environment for the agent. PacMan receives a reward for eating food and punishment if it gets killed by the ghost (loses the game). The states are the location of PacMan in the grid world and the total cumulative reward is PacMan winning the game.
3 Approaches of Reinforcement Learning:
Episodic or Continuing tasks
A task is an instance of a Reinforcement Learning problem. We can have two types of tasks: episodic and continuous.
In this case, we have a starting point and an ending point (a terminal state). This creates an episode: a list of States, Actions, Rewards, and New States.
For instance think about Super Mario Bros, an episode begin at the launch of a new Mario and ending: when you’re killed or you’re reach the end of the level.
These are tasks that continue forever (no terminal state). In this case, the agent has to learn how to choose the best actions and simultaneously interacts with the environment.
For instance, an agent that do automated stock trading. For this task, there is no starting point and terminal state. The agent keeps running until we decide to stop him.
Monte Carlo vs TD Learning methods
We have two ways of learning:
- Collecting the rewards at the end of the episode and then calculating the maximum expected future reward: Monte Carlo Approach
- Estimate the rewards at each step: Temporal Difference Learning
When the episode ends (the agent reaches a “terminal state”), the agent looks at the total cumulative reward to see how well it did. In Monte Carlo approach, rewards are only received at the end of the game.
Then, we start a new game with the added knowledge. The agent makes better decisions with each iteration.
Temporal Difference Learning : learning at each time step
TD Learning, on the other hand, will not wait until the end of the episode to update the maximum expected future reward estimation: it will update its value estimation V for the non-terminal states St occurring at that experience.
This method is called TD(0) or one step TD (update the value function after any individual step).
Exploration/Exploitation trade off
Before looking at the different strategies to solve Reinforcement Learning problems, we must cover one more very important topic: the exploration/exploitation trade-off.
- Exploration is finding more information about the environment.
- Exploitation is exploiting known information to maximize the reward.
You can also check out our post on 8 Neural Network Architectures Machine Learning Researchers Need to Learn