Q-Learning

 

 

Q-learning is a model-free, value-based, off-policy reinforcement learning algorithm that will find the best series of actions based on the agent’s current state.

Components of Q-Learning:

         1. State: Current position of the agent in the environment

         2. Action: operation performed by the agent in a particular state.

         3. Reward: positive or a negative response for the agent’s actions.

         4. Episode: Instance where an agent concludes its actions.

         5. Q-Values: metrics used to measure an action at a particular state

         6. Q-Table: rows and columns with lists of rewards for the best actions of each state in a specific environment.

         7. Q-Function: uses the Bellman equation and takes two inputs: state (s) and action (a).

 

Bellman Equation:
A recursive formula for optimal decision-making. In the q-learning context, Bellman’s equation is used to help calculate the value of a given state and assess its relative position. The state with the highest value is considered the optimal state.
Q (s, a) = Q (s, a) + α * (r + γ * max (Q (s’, a’)) – Q (s, a))

         -> Q (s, a) represents the expected reward for taking action a in state s.

         -> The actual reward received for that action is referenced by r while s’ refers to the next state.

         -> The learning rate is α and γ is the discount factor.

         -> The highest expected reward for all possible actions a’ in state s’ is represented by max (Q (s’, a’)).

 

Temporal difference:
The temporal difference formula calculates the Q-value by incorporating the value of the current state and action by comparing the differences with the previous state and action.

 

Epsilon Greedy Strategy:
In the beginning, the epsilon rates will be higher. The robot will explore the environment and randomly choose actions. This is because, the robot does not know anything about the environment. As the robot explores the environment, the epsilon rate decreases and the robot starts to exploit the environment.

         1. Generate the random number between 0 to 1.

         2. If the random number is greater than epsilon, we will do exploitation. It means that the agent will take the action with the highest value given a state.

         3. Else, we will do exploration (Taking random action).

 

Pseudo-code for Q-learning in the episodic task using Q-table

         1. Initialize with Q-table with zero value.

         2. Episode begins

         3. Perform action a_t from state s_t and observe the next state s_t+1 and reward r

         4. Compute the new Q value using the below equation and update the Q-table

         5. s_t+1 is the new state s_t and repeat steps 3 to 4 until the s_t+1 reaches the terminal state

         6. Episode ends

         7. Repeat steps 2 to 6 until the optimal Q value is reached

 

                                   

 

 

 

 

 

 

 

 

 

 

Explanation with an example: Robot example

Step 1: Initialize Q table

        -> n = number of columns= number of actions

        -> m= number of rows= number of states

        -> Initialize all values to 0

Step 2: Episode starts. Choose an action using the Epsilon Greedy Strategy. This is the action that it passes to the environment to execute, and gets feedback in the form of a reward.

Step 3: Perform action

Step 4: An action has been taken and we have an outcome and a reward. The function Q (s, a) must be evaluated and the Q-Table must be updated. In this case, to reiterate, the reward structure is:

        -> power = +1

        -> mine = -100

        -> end = +100

More the iterations, more accurate the Q-values get. In this way, we explore the environment and update the Q-Table. When the Q-Table is ready, the starts to exploit the environment.

 

Q Learning uses two different actions in each time-step:

         1. Current action — the action from the current state that is actually executed in the environment, and whose Q-value is updated.

         2. Target action — has the highest Q-value from the next state, and used to update the current action’s
         Q value.

 

Advantages:

        -> Long-term outcomes are best achieved using this strategy.

        -> Has ability to fix mistakes made during training.

 

Disadvantages:
Q-learning with a Q-table only works for specific, countable actions and states. Splitting continuous actions and states into discrete parts makes learning slow and inefficient due to too many combinations.

 

Variants:

        -> Deep Q-Learning

        -> Double Q-Learning

        -> Multi-agent learning

        -> Delayed Q-Learning

 

Applications:

        -> Games

        -> Automation: Robot control

        -> Traffic management: Self-driving cars

        -> Space travel: Satellite control

        -> Algorithmic trading

        -> Network Resource Allocation: Allocating bandwidth in communication networks

Q-learning is a reinforcement learning algorithm that helps agents learn optimal actions from their environment. It’s best for discrete actions an.d states but can be inefficient for continuous spaces. Variants like Deep Q-Learning address this. Its applications range from robotics to self-driving cars and gaming,
highlighting its versatility and importance.

  1.