In this post, I attempt to provide a concise history about Recurrent Neural Networks(RNN) & its advanced versions: why did they come into existence? why did they become senile over time? what are the current generations of RNNs?
RNNs are a class of neural networks designed to capture temporal(time-related) dependencies, which was earlier impossible to achieve with feed-forward neural networks. RNNs as the name recurrent implies, the network learns a set of weights that are shared across time, in layman terms: as time advances the learnings of past(a.k.a weights) are not lost but carried over so that the network remembers what it learned in the past. This information that is carried over is called history. So what?…Well, this renders the model to not only consider current weights but also the past ones while making predictions.
Statistically, RNN cells emulate latent Autoregressive models. The prediction at ‘t’ time step is conditioned on the input observed at time ‘t’ and history(a sophisticated function) until time ‘t’. However, RNNs have shortcomings like vanishing gradient and it renders them senile for interesting applications involving long term dependencies of the input. For the sake of simplicity, one may consider vanishing gradient to be attenuation of signal(data).
RNN’s shortcomings are addressed by Long Short Term Memory(LSTM)s and Gated Recurrent Unit(GRU)s by filtering the history of an RNN cell by retaining only the part of history that is relevant to the scenario and forgetting the rest.
These GRUs & LSTMs are selective in deciding which histories to keep & which ones to discard at what timestep.
GRU(Gated Recurrent Unit)
GRUs have mechanisms, deciding when a hidden state(history) should be updated and when it should be reset. These mechanisms are learned during training.
With respect to programing, a reset variable decides how much of the previous state will be retained, and an update variable decides how much of the state that has been modified by the reset variable will be used. These variables are vectors with values between 0 & 1.
Reset gate
To reduce the influence of the previous states we multiply hidden state(vector) with reset state(vector), elementwise. The elements of hidden state corresponding to values of the reset state that are close to 1 are more likely to be retained than those which are close to 0 leading to a new modified hidden state.
There are 2 special cases depending on elements of reset state(vector),
- If all elements = 1, then GRU behaves like a Vanilla RNN– everything in hidden state is remembered.
- If all elements = 0, then GRU behaves like a feed-forward neural network– nothing in hidden state is remembered.
Update gate
This gate assigns the weight between the new modified hidden state from the reset gate and the old state. If the values of update state(vector) are close to 1, then old state is retained, skipping the input at timestep ‘t’; else if the update state values are close to 0, then modified state from reset state is used.
To summarize,
Reset gate captures short-term dependencies.
Update gate captures long-term dependencies.