**Reinforcement learning** is a pivotal technique in machine learning that trains agents through trial and error to make sequential decisions. By leveraging the reward-and-punishment paradigm, RL algorithms enable machines to learn from their interactions with the environment, optimizing their performance in unseen scenarios. This approach is particularly well-suited for dynamic and complex tasks, as the agent continually refines its strategy based on feedback received after each action, ultimately achieving optimal behavior.

**Elements** in Reinforcement learning:

** 1. **Policy, defines the learning agent’s way of behaving at a given time.

** 2. **Reward function, used to define a goal in a reinforcement learning problem.

** 3. **Value function, specifies what is good in the long run.

** 4. **Model of the environment, mimics the behavior of the environment and used for planning, if a state and an action are given, then a model can predict the next state and reward.

**Types** of Reinforcement learning:

** 1. **Positive: Positive Reinforcement is defined as when an event, occurs due to a particular behaviour, increases the strength and the frequency of the behaviour. In other words, it has a positive effect on behaviour.

** 2. **Negative: Negative Reinforcement is defined as strengthening of behaviour because a negative condition is stopped or avoided.

**Markov Decision Process:**

Markov Decision Process or MDP, is used to formalize the reinforcement learning problems. If the environment is completely observable, then its dynamic can be modelled as a Markov Process. In MDP, the agent constantly interacts with the environment and performs actions; at each action, the environment

responds and generates a new state.

The learner and decision-maker are called the agent. Everything outside the agent which it interacts with, is called environment. The agent and environment interact at each of a sequence of discrete time steps. At each time step t, the agent receives some representation of the environment’s state and on that basis

selects an action. One time step later, in part as a consequence of its action, the agent receives a numerical reward. At each time step, the agent implements a mapping from states to probabilities of selecting each possible action. This mapping is called the agent’s policy. A Model gives an action’s effect in a state.

**-> Policy**: In a reinforcement learning setting, the policy is typically learned by interacting with the environment and receiving feedback in the form of rewards. Algorithms like Q-learning, Policy Gradients, or Deep Q-Networks (DQN) can be used to find the optimal policy π* that maximizes the cumulative reward

over time. It can be represented as a table (for discrete states) or a function approximator.

** ->** S is the set of possible states

** ->** A(St) is the set of actions available in state St

** ->** πt is the policy

** ->** e πt(a|s) is the probability that At = a if St = s

** ->** reward, Rt+1 ∈ R ⊂ R

**Explanation of MDP with an example: **

Problem Statement: **Reducing wait time at traffic intersection**

**Constraints**:

** ->** to maximize number of cars passing through intersection without stopping

** ->** 2-way intersection, North and East

** ->** Traffic light has only 2 colors: red and green

** ->** Each time step represents a few seconds, at each time step we decide to change light color or not

**State**: represented as a combination of

** 1. **The color of the traffic light (red, green) in each direction.

** 2. **Duration of the traffic light in the same color.

** 3. **The number of cars approaching the intersection in each direction.

**Actions**: Change light colour, do not change light colour.

**Reward** = (number of cars expected to pass in the next time step) — α * exp (β * duration of the traffic light red in the other direction.

α is a scaling factor that determines the overall weight or importance of the penalty for having a red light in one direction while allowing traffic in the other direction.

β controls the exponential growth rate of the penalty based on the duration for which the traffic light has been red.

**Policy**, denoted as **π** maps states to actions: The objective of the policy is to maximize the cumulative reward over time, which in this case is the number of cars passing through the intersection while minimizing the penalty for blocking traffic in the other direction.

**State transitions**: Deterministic

3 of all states shown

Example state representations, actions, reward and policy:

**States** S= (T_{N}, T_{E}, t_{N}, t_{E}, n_{N}, n_{E})

** ->** T_{N}, Traffic color light at North

** ->** T_{E}, Traffic color light at East

** ->** t_{N}, Duration for which the light has been red or green in the North

** ->** t_{E}, Duration for which the light has been red or green in the East

** ->** n_{N}, Number of cars approaching the intersection from North

** ->** n_{E}, Number of cars approaching the intersection from the East

**Actions** A = (a_{N}, a_{E})

** ->** a_{N}, action taken in North direction (change colour= 1, do not change= 0)

** ->** a_{E}, action taken in East direction (change colour= 1, do not change= 0)

**Reward** R (S, A) = (n_{N}+ n_{E}) −α⋅ (e^{β}^{⋅} ^{tN} +e^{β}^{⋅}^{ tE})

**Policy**, π: S→A

**Reinforcement Learning Algorithms**:

There are various algorithms used in reinforcement learning (RL)—such as Q-learning, policy gradient methods, temporal difference learning, etc. These can be classified on different basis.

On the basis of **use of model** of environment:

** 1. Model-based **RL: Used when environment is well-defined and fixed, agent builds an internal representation (model) of the environment. Examples: Monte Carlo Tree Search (

** 2. Model-free **RL: Used in environments which are large, complex and indescribable, agent uses a trial-and-error approach within the environment. Examples:

On the basis of **type of function**:

** 1. Value-Based**: The method initializes the state-value function (V) with random values and then iteratively improves its estimate until convergence. Examples:

** 2. Policy-Based**: Policy improvement will try to improve the policy and assign a new value to the state. The algorithm will keep working back and forth between the two phases until the optimal value of the state is found. Examples:

On the basis of **policy**:

Behavior Policy is how an agent will actin any given state and Update policy is how the agent imagines it will act while calculating the value of a state-action pair

** 1. On-Policy**: Value of a state-action pair is calculated using the current behaviour policy or behaviour policy and the update policy are same. Examples: Actor-critic method (

** 2. Off-Policy**: Behavior policy and the update policy are different. Examples:

**Applications** of Reinforcement Learning:

** 1. Robotics**: Robot navigation, Robo-soccer, walking, juggling

** 2. Marketing**: Customize suggestions to individual users based on their interactions

** 3. Control Systems**: Elevator dispatching

** 4. Game Playing**: Tic-tac-toe, chess

** 5. Finance**: Optimize long-term returns by considering transaction costs and adapting to market shifts.

**Challenges/ Disadvantages** of Reinforcement Learning:

** 1. **Experimenting with real-world reward and punishment systems may not be practical.

** 2. **Difficult to deduce, debug and interpret.

** 3. **Needs a lot of data and a lot of computation.

** 4. **Highly dependent on the quality of the reward function. If the reward function is poorly designed, the agent may not learn the desired behavior.

Reinforcement Learning is an intriguing and valuable subset of Machine Learning. It enables agents to learn optimal behaviors through interaction with their environment, leveraging trial and error to improve performance over time. While RL presents challenges, its potential applications make it a powerful tool for solving complex, dynamic problems.