Rinforcement Learning is a type of machine learning. It is a goal-oriented learning algorithm, which learn to achieve it’s goals through trial and error. It uses feedback from its own action and experience.
Some key terms that describe the basic elements of an RL problem are:
- Environment — Physical world in which the agent operates
- State — Current situation of the agent
- Reward — Feedback from the environment
- Policy — Method to map agent’s state to actions
- Value — Future reward that an agent would receive by taking an action in a particular state
Comparison with other machine learning methodologies
- Supervised vs Reinforcement Learning: In supervised learning, as the name suggest we have a ‘supervisor’, who has knowledge of the environment and who it share them with the agent. But in case of some problems there many subtasks that can be perform to achieve the objective.
- In such cases having a ‘supervisor’ is difficult. For example, a chess game, we have thousands of moves that can be played at any point of in the game. So having a knowledge base becomes a difficult task. It is more feasible for the agent to learn from it’s own experiences and use it to gain knowledge of the environment.
- In both supervised and reinforcement learning, there is a mapping between input and output. But in reinforcement learning we also have a reward system which provides feedback to the agent as opposed to supervised learning.
- Unsupervised vs Reinforcement Learning: In unsupervised learning we don’t have the mapping between input and output unlike reinforcement and supervised learning. In unsupervised learning, the task is to find the patterns without supervisor.
- For example, categorizing news articles.
Framework for solving Problems
Markov Decision Process:
The mathematical framework used for deriving a solution in reinforcement learning task is called Markov Decision Process. MDP is a framework for solving problem of learning from feedback and interaction to achieve a goal. All RL problems can be described using MDP. MDP is based on Markov property that states
“The future is independent of the past given the present.“.
MDP can be designed as:
- Set of states, S
- Set of actions, A
- Reward function, R
- Policy, π
- Value, V
Here A is set of actions taken to transition from start state to end state S. In return we get rewards R for each action taken. These actions can return both a positive reward or negative reward.
The set of actions taken by us defines the policy π and the rewards we get in return defines value V. Our aim here is to maximize the rewards by choosing the most optimal policy.
Thanks for reading, you can also check our post on:
Introduction To Reinforcement Learning