Gated Recurrent Unit

M K Sumana

 

In this article we will be discussing about GRUs, which are simple alternatives to LSTMs. We will learn about their architecture, working, their advantages over LSTMs, their pros and cons and applications. Recurrent Neural Networks (RNNs) have emerged as a powerful deep learning algorithm for processing sequential data. However, RNNs struggle with long-term dependencies within sequences. This is where Gated Recurrent Units (GRUs) come in. As a type of RNN equipped with a specific learning algorithm, GRUs address this limitation by utilizing gating mechanisms to control information flow, making them a valuable tool for various tasks in machine learning.[i] Gated Recurrent Unit (GRU) is a type of recurrent neural network (RNN) that was introduced by Cho et al. in 2014 as a simpler alternative to Long Short-Term Memory (LSTM) networks. Like LSTM, GRU can process sequential data such as text, speech, and time-series data. The basic idea behind GRU is to use gating mechanisms to selectively update the hidden state of the network at each time step.[ii]

GRU Architecture

 

 

 

 

 

The GRU architecture consists of the following components:

  1. Input layer: The input layer takes in sequential data, such as a sequence of words or a time series of values, and feeds it into the GRU.

  2. Hidden layer: The hidden layer is where the recurrent computation occurs. At each time step, the hidden state is updated based on the current input and the previous hidden state. The hidden state is a vector of numbers that represents the network’s “memory” of the previous inputs.

  3. Reset gate: The reset gate determines how much of the previous hidden state to forget. It takes as input the previous hidden state and the current input, and produces a vector of numbers between 0 and 1 that controls the degree to which the previous hidden state is “reset” at the current time step.

  4. Update gate: The update gate determines how much of the candidate activation vector to incorporate into the new hidden state. It takes as input the previous hidden state and the current input, and produces a vector of numbers between 0 and 1 that controls the degree to which the candidate activation vector is incorporated into the new hidden state.

  5. Candidate activation vector: The candidate activation vector is a modified version of the previous hidden state that is “reset” by the reset gate and combined with the current input. It is computed using a tanh activation function that squashes its output between -1 and 1.

  6. Output layer: The output layer takes the final hidden state as input and produces the network’s output. This could be a single number, a sequence of numbers, or a probability distribution over classes, depending on the task at hand. [iii]

Working of GRUs:

  1. Calculate the update gate z_t for time step t using the formula:


    When x_t is plugged into the network unit, it is multiplied by its own weight W(z). The same goes for h_(t-1) which holds the information for the previous t-1 units and is multiplied by its own weight U(z). Both results are added together and a sigmoid activation function is applied to squash the result between 0 and 1.

  2. As before, we plug in h_(t-1) and x_t , multiply them with their corresponding weights, sum the results and apply the sigmoid function. Calculate reset gate using the formula:


  3. Do an element-wise multiplication of h_(t-1) and r_t and then sum the result with the input x_t. Finally, tanh is used to produce h’_t, a memory content which will use the reset gate to store the relevant information from the past. It is calculated as follows:



  4. Next, we calculate h_t — vector which holds information for the current unit and passes it down to the network. It determines what to collect from the current memory content — h’_t and what from the previous steps — h_(t-1). That is done as follows:

    [iv]

Comparison of GRUs and LSTMs:

Primarily, GRUs have two gates compared to the three gates in LSTM cells. A notable aspect of GRU networks is that they do not include a separate cell state (C_t​), unlike LSTMs. Instead, GRUs only maintain a hidden state (H_t​). This simpler architecture allows GRUs to train faster. In GRUs, a single update gate manages both the historical information (H_{t-1}​) and the new information from the candidate state, unlike LSTMs, which use separate gates for these functions. [v]

 

Applications of GRUs in Real-World Scenarios:

  1. In speech recognition systems, GRUs are employed for tasks like speech-to-text conversion, phoneme recognition, and speaker identification.
  2. GRUs are also utilized in time series prediction tasks, including financial forecasting, stock market analysis, and weather prediction.
  3. Their ability to capture temporal dependencies and handle sequential data makes GRUs suitable for applications in video analysis, gesture recognition, and action recognition.
  4. In healthcare, GRUs are used for patient monitoring, disease prediction, and medical image analysis, leveraging sequential patient data for diagnosis and treatment planning. [vi]

 

Advantages:

  1. Faster training and efficiency compared to LSTMs.
  2. Effective for sequential tasks: Their gating mechanisms allow them to selectively remember or forget information, leading to better performance on tasks like machine translation or forecasting.
  3. Less Prone to Gradient Problems: The gating mechanisms in GRUs help mitigate the vanishing/exploding gradient problems that plague standard RNNs. [vii]

Disadvantages:

  1. May be more prone to overfitting than LSTMs, especially on smaller datasets.
  2. Their simpler gating mechanism can limit their ability to capture very complex relationships or long-term dependencies in certain scenarios.
  3. GRU networks require careful tuning of hyperparameters, such as the number of hidden units and learning rate, to achieve good performance.
  4. Not as interpretable as other machine learning models due to the gating mechanism. [viii] [ix]

Gated Recurrent Units (GRUs) offer a streamlined and efficient alternative to Long Short-Term Memory (LSTM) networks for processing sequential data. Their simpler architecture, featuring only two gates and a combined hidden state, results in faster training times without significantly compromising performance. GRUs excel in tasks where quick training and effective handling of temporal dependencies are crucial, such as speech recognition, time series forecasting, and healthcare applications. Although they may not capture very long-term dependencies as effectively as LSTMs, GRUs balance simplicity and power, making them a versatile tool in the machine learning toolkit. This balance allows GRUs to address the limitations of traditional RNNs while offering a practical solution for many sequential data challenges.

Papers which provide deeper insights into GRUs:

  1. Gate-Variants of Gated Recurrent Unit (GRU) Neural Networks: https://arxiv.org/pdf/1701.05923
  2. Deep Learning with Gated Recurrent Unit Networks for Financial Sequence Predictions: https://www.sciencedirect.com/science/article/pii/S1877050918306781
  3. Comparative analysis of Gated Recurrent Units (GRU), long Short-Term memory (LSTM) cells, autoregressive Integrated moving average (ARIMA), seasonal autoregressive Integrated moving average (SARIMA) for forecasting COVID-19 trends: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9453185/

References:


[i] Analytics Vidhya

[ii] Geeksforgeeks

[iii] Medium – anishnama20

[iv] Towards Data Science

[v] Analytics Vidhya

[vi] Medium – harshedabdulla

[vii] Analytics Vidhya

[viii] Analytics Vidhya

[ix] Medium – anishnama20