Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Lecture 8: Recurrent Neural Networks

HTML Slides | PDF Slides

Recurrent Neural Networks

COMP 4630 | Winter 2026 Charlotte Curtis


Overview


Sequence data

  • So far we’ve been talking about images, tabular data, and other “static” data
  • ❓ What are some examples of sequence data?

h:350


Non-RNN Approaches

bg right fit

As usual, you don’t always need a deep learning solution :hammer:

  • ❓ What is an example of a “naive” approach?
  • ❓ What are some limitations of naive approaches?

Autoregressive Moving Average

  • Models to predict time series with a weighted average of past value where
  • Key assumption: data is stationary (mean and variance don’t change)
  • ARIMA adds on “integration” or “differencing” to account for trends

ARIMA

  • Autoregressive parameter : How many steps back to average?
  • Moving average parameter : How many previous errors to average?
  • Integrative parameter : How many “differencing” rounds to perform before applying ARMA?

    can be thought of as approximating the order polynomial


  • ❓ Are there any obvious trends in the data?
  • ❓ What about non-obvious trends?
  • ❓ How might this dataset be treated differently from the previous one?


Feedforward vs recurrent networks

  • Feedforward: data flows in one direction (then backpropagated)
  • Recurrent: data can flow in loops center
Figure from Scikit-learn textbook

Recurrent layers

  • The simplest recurrent layer has a single feedback connection where is the activation function and and are weight matrices
  • “Backpropagation through time” (BPTT) is exactly the same as regular backpropagation through the unrolled network
  • ❓ What kind of issues might arise during training?
  • ❓ What are some limitations of this approach?
  • ❓ How can we deal with for ?

Preparing data for RNNs

  • The data format depends on the task, e.g. do you want to predict:
    • The next value in a sequence (e.g. predictive text)
    • The next values in a sequence (e.g. stock prices)
    • The next sequence in a set of sequences (e.g. language translation)
  • Let’s start with predicting the next value in a sequence

center


Activation Functions for RNNs

  • The default activation function in tensorflow/PyTorch is tanh
  • ❓ What is different about RNNs that might influence the choice of activation function?
  • ❓ How might we normalize sequence data?

Beyond the “next value”

  • Option 1: Use the single-prediction RNN repeatedly
  • Option 2: Train the RNN to predict multiple values at once
    • Easy change model-wise, but data preparation is trickier
    • n inputs, n outputs
  • Option 3: Use a “sequence to sequence” model
    • Even trickier data preparation, but n inputs are predicted at each time step instead of just at the end

Seq2seq input/target examples

InputTarget
1[0, 1, 2][1, 2, 3]
2[0, 1, 2][[1, 2], [2, 3], [3, 4]]
3[0, 1, 2][[1, 2, 3], [2, 3, 4], [3, 4, 5]]

Problems with long sequences

  • Gradient vanishing/exploding
    • Choose activation functions and initialization carefully
    • Consider “Layer normalization” (across features)
  • “Forgetting” early data
    • Skip connections through time
    • “Leaky” RNNs
    • Long short-term memory (LSTM)
  • Computational efficiency and memory constraints
    • Gated recurrent units (GRUs)

Skip connections and leaky RNNs

  • Simple way of preserving earlier data:
  • Vanilla RNN: depends on only
  • Skip connection: depends on , , , etc.
  • Leaky RNN has a smooth “self-connection” to dampen the exponential:
  • Not common approaches anymore, as LSTM, GRU, and especially attention mechanisms are more popular

Long Short-Term Memory (LSTM)

h:500 center

Figure from Scikit-learn textbook

Gated Recurrent Units (GRUs)

h:500 center

Figure from Scikit-learn textbook

Next up: Natural Language Processing


Preview: Natural Language Processing

  • Natural Language Processing (NLP) is a field of study that focuses on the interaction between computers and human language.
  • RNNs are widely used in NLP tasks such as language modeling, machine translation, sentiment analysis, and text generation.
  • Language modeling involves predicting the next word in a sequence of words, which can be done using RNNs.
  • Machine translation uses RNNs to translate text from one language to another.
  • Sentiment analysis aims to determine the sentiment or emotion expressed in a piece of text, and RNNs can be used for this task.
  • Text generation involves generating new text based on a given input, and RNNs are commonly used for this purpose.
This slide written by GitHub Copilot in Winter 2024

Preview: Natural Language Processing

  • What is Natural Language Processing (NLP)?
  • Common NLP tasks:
    • Language modeling
    • Machine translation
    • Sentiment analysis
    • Text generation
  • How RNNs are applied in NLP
This slide written by GitHub Copilot in Winter 2025

Preview: Natural Language Processing

  • NLP in 2026 is dominated by large language models (LLMs) like GPT-4o, Claude, and Gemini
  • Transformer-based architectures have largely replaced RNNs for most NLP tasks
  • Key capabilities of modern NLP systems:
    • Multi-modal understanding (text, images, audio, video)
    • Long-context reasoning (millions of tokens)
    • Agentic behaviour: tool use, planning, and self-correction
  • ❓ If transformers have replaced RNNs, why are we still studying them?
This slide written by GitHub Copilot in Winter 2026