Reinforcement learning is a pivotal technique in machine learning that trains agents through trial and error to make sequential decisions. By leveraging the reward-and-punishment paradigm, RL algorithms enable machines to learn from their interactions with the environment, optimizing their performance in unseen scenarios. This approach is particularly well-suited for dynamic and complex tasks, as the agent continually refines its strategy based on feedback received after each action, ultimately achieving optimal behavior.
Elements in Reinforcement Learning
Policy: Defines the learning agent’s way of behaving at a given time.
Reward Function: Used to define a goal in a reinforcement learning problem.
Value Function: Specifies what is good in the long run.
Model of the Environment: Mimics the behavior of the environment and is used for planning. If a state and an action are given, a model can predict the next state and reward.
Types of Reinforcement Learning
Positive: Positive Reinforcement occurs when an event, resulting from a particular behavior, increases the strength and frequency of that behavior. It has a positive effect on behavior.
Negative: Negative Reinforcement strengthens behavior by stopping or avoiding a negative condition.
Markov Decision Process (MDP)
Markov Decision Process or MDP, is used to formalize the reinforcement learning problems. If the environment is completely observable, then its dynamic can be modelled as a Markov Process. In MDP, the agent constantly interacts with the environment and performs actions; at each action, the environment
responds and generates a new state.
The learner and decision-maker are called the agent. Everything outside the agent which it interacts with, is called environment. The agent and environment interact at each of a sequence of discrete time steps. At each time step t, the agent receives some representation of the environment’s state and on that basis
selects an action. One time step later, in part as a consequence of its action, the agent receives a numerical reward. At each time step, the agent implements a mapping from states to probabilities of selecting each possible action. This mapping is called the agent’s policy. A Model gives an action’s effect in a state.
-> Policy: In a reinforcement learning setting, the policy is typically learned by interacting with the environment and receiving feedback in the form of rewards. Algorithms like Q-learning, Policy Gradients, or Deep Q-Networks (DQN) can be used to find the optimal policy π* that maximizes the cumulative reward over time. It can be represented as a table (for discrete states) or a function approximator.
-> S is the set of possible states
-> A(St) is the set of actions available in state St
-> πt is the policy
-> e πt(a|s) is the probability that At = a if St = s
-> reward, Rt+1 ∈ R ⊂ R
Explanation of MDP with an example:
Problem Statement: Reducing wait time at traffic intersection
Constraints:
-> to maximize number of cars passing through intersection without stopping
-> 2-way intersection, North and East
-> Traffic light has only 2 colors: red and green
-> Each time step represents a few seconds, at each time step we decide to change light color or not.
State: represented as a combination of
1. The color of the traffic light (red, green) in each direction.
2. Duration of the traffic light in the same color.
3. The number of cars approaching the intersection in each direction.
Actions: Change light colour, do not change light colour.
Reward = (number of cars expected to pass in the next time step) — α * exp (β * duration of the traffic light red in the other direction.
α is a scaling factor that determines the overall weight or importance of the penalty for having a red light in one direction while allowing traffic in the other direction.
β controls the exponential growth rate of the penalty based on the duration for which the traffic light has been red.
Policy, denoted as π maps states to actions: The objective of the policy is to maximize the cumulative reward over time, which in this case is the number of cars passing through the intersection while minimizing the penalty for blocking traffic in the other direction.
Policy, denoted as π maps states to actions: The objective of the policy is to maximize the cumulative reward over time, which in this case is the number of cars passing through the intersection while minimizing the penalty for blocking traffic in the other direction.
Example state representations, actions, reward and policy:
States S= (TN, TE, tN, tE, nN, nE)
TN: Traffic light color at North.
TE: Traffic light color at East.
tN: Duration for which the light has been red or green in the North.
tE: Duration for which the light has been red or green in the East.
nN: Number of cars approaching the intersection from the North.
nE: Number of cars approaching the intersection from the East.
Actions A = (aN, aE)
-> aN, action taken in North direction (change colour= 1, do not change= 0)
-> aE, action taken in East direction (change colour= 1, do not change= 0)
Reward R (S, A) = (nN+ nE) −α⋅ (e^β⋅ tN +e^β⋅ tE)
Policy, π: S→A
Reinforcement Learning Algorithms
Model-Based vs. Model-Free:
Model-Based RL: Builds an internal model of the environment to simulate and plan actions. Suitable for well-defined environments. Examples: Monte Carlo Tree Search (MCTS), Dynamic Programming (DP).
Model-Free RL: No internal model is used; the agent learns through trial and error. Suitable for large and complex environments. Examples: Q-Learning, SARSA.
Value-Based vs. Policy-Based:
Value-Based Methods: Initialize the state-value function (V) with random values and iteratively improve it until convergence. Examples: Q-Learning, SARSA.
Policy-Based Methods: Directly optimize the policy, alternating between policy evaluation and improvement. Examples: REINFORCE, TRPO.
On-Policy vs. Off-Policy:
On-Policy: The agent learns the value of the current behavior policy. Examples: Actor-Critic (A2C), REINFORCE.
Off-Policy: The agent learns a policy different from the current behavior. Examples: Q-Learning, DDPG.
Marketing: Customized suggestions for users based on interactions.
Control Systems: Optimizing elevator dispatching.
Game Playing: Games like chess and tic-tac-toe.
Finance: Optimizing long-term returns while considering transaction costs.
Challenges/ Disadvantages of Reinforcement Learning:
1. Experimenting with real-world reward and punishment systems may not be practical.
2. Difficult to deduce, debug and interpret.
3. Needs a lot of data and a lot of computation.
4. Highly dependent on the quality of the reward function. If the reward function is poorly designed, the agent may not learn the desired behavior.
Reinforcement Learning is an intriguing and valuable subset of Machine Learning. It enables agents to learn optimal behaviors through interaction with their environment, leveraging trial and error to improve performance over time. While RL presents challenges, its potential applications make it a powerful tool for solving complex, dynamic problems.