What Is a Transformer?

If you have been following any tech news in the last two years, you must have heard about OpenAI and their AI assistant ChatGPT, the immense efforts in the world of big tech to catch up to OpenAI and develop their own AI assistant and the ones profiting from it all (Nvidia shareholders). In a series of blog posts, we want to give you a high-level understanding of the technology behind these models and products so that you know what you are talking about when discussing these potential world-changing developments with your colleagues in the pantry, your family at the dinner table or pitching a new AI product to your boss.

Large Language Models (LLMs) are a major breakthrough in natural language processing, a subfield of machine learning. The most successful model to date is OpenAI’s generative pre-trained transformer (GPT), popularized by their chatbot application ChatGPT. Other language models based on the transformer architecture are Google’s BERT, PaLM, and the latest Gemini Pro 1.5, Meta’s LLaMA 2 70B, and MistralAI’s Mistral 7B. Where the letter ‘B’ stands for Billion, referring to the number of parameters in the model. 

So, what is a transformer? Transformers are a specific type of neural network architecture first introduced in the paper “Attention is all you need” by Google scientists in 2017 (see references below the article for a link to the paper). In essence, these types of models are trying to predict the next word in a sequence, a very complicated version of autocomplete, if you will. The researchers at Google introduced something called a self-attention mechanism (hence the name of the paper) that allows the network to understand the importance of different words within a sequence and how much they are related to one another. 

Let’s take for example the sentence “Time flies when you’re having fun”. We would expect the words Time and flies to pay significant attention to each other as the verb flies tells you what happens to the noun Time. Similarly, we would expect that the word fun is connected to flies, to embody the meaning of the sentence, linking time passage and enjoyment. Since the attention is calculated for each word within a sentence relative to each other word in a sentence, this creates a context-aware representation for each word.

To fully appreciate the novelty of this approach, we can have a look at the previous machine learning based approaches of modeling natural language, notably the recurrent neural network (RNN) and its evolution Long short-term memory (LSTM).

RNNs were developed to improve standard feed forward neural networks for processing sequential data. RNNs innovation is to introduce a sequential representation of the data using a hidden state that stores information from the previous input. This helps to introduce a dependence structure across time, as now every output is connected to the previous input through the hidden state. Figure 1 shows the RNN network architecture.

A diagram of a diagramDescription automatically generated
Figure 1. Source: https://medium.com/@poudelsushmita878/recurrent-neural-network-rnn-architecture-explained-1d69560541ef

The problem with RNNs is that this dependence structure can disappear quickly across time, this is known as the vanishing gradient problem. It is a problem that can occur when training the model via backpropagation. In this case, the gradients just become too small to capture long term dependencies, i.e. the model might forget at the end of a sentence what happened in the beginning of the sentence. Another problem that might occur are exploding gradients, making it difficult for the model to converge to an optimal solution.

Long short-term memory (LSTM) models attempted to fix this issue by introducing gates to manage the information flow from one timestep to the next. The LSTM has cell states that act as the memory of the model at each timestep, holding information from previous timesteps. The forget gate controls how much information from the previous cell state is kept. The input gate and the candidate state act together to decide what new information from our current input to add to the new cell state.

Figure 2. Source: https://d2l.ai/chapter_recurrent-modern/lstm.html

As you might have noted, the information in both RNNs and LSTMs flows from past time steps to future time steps. However, this poses some problems for natural language models, especially in the domain of translation. Translating sentences word-by-word does not work very well as different languages have different word order rules or some words have different meanings depending on the context of the rest of the sentence or the entire document. An example is the word bank. Does it refer to the furniture you can sit down on or to the financial institution? If the model has access to the context of the text, it will be able to better understand the meaning of the word bank

Now coming back to transformers, as mentioned previously, the main innovation is the self-attention mechanism. Instead of having a time-series like representation, where the information flows from “past” to “present” as in RNN and LSTM, in transformers the self-attention mechanism connects each individual word in the input sequence with each other word in the input sequence using a scaled dot-product mechanism. Additionally, they use also implemented something called cross-attention. This is like self-attention, but the dot-product mechanism is used connecting one input sequence with the previous or next input sequence.

A diagram of a processDescription automatically generated
Figure 3. Source: https://magazine.sebastianraschka.com/p/understanding-and-coding-self-attention

This attention mechanism has two advantages. It gives context information to each word, making it easier for the model to understand the meaning behind the word in its current context, reducing the vanishing gradient problem from RNNs. Additionally, this also helps in training transformers using parallel processing, as they do not have to process the inputs one by one, but can process them simultaneously, significantly speeding up calculations and allowing big tech companies to use gigantic datasets to train the models and scale them up efficiently. 

I hope this blog gave you a high-level overview over the 2017 innovation in the self-attention paper that is fuelling today’s tech industry. Below I will share some links to further material that goes into more detail if you are interested. In the next series, I will be talking about how transformers, which are basically document completion models, get turned into helpful AI assistants.

Further reading:

  1. The self-attention paper itself: https://arxiv.org/pdf/1706.03762.pdf
  2. Understanding and Coding the Self-Attention Mechanism of LLMs From Scratch: https://sebastianraschka.com/blog/2023/self-attention-from-scratch.html
  3.  Building GPT from scratch by Andrej Karpathy (founding member of OpenAI): https://www.youtube.com/watch?v=kCc8FmEb1nY
  4. Transformer Neural Networks clearly explained by Statquest: https://www.youtube.com/watch?v=zxQyTK8quyY
  5. How to craft a good prompts: https://www.queryvary.com/post/how-to-write-a-good-system-prompt-instructions-to-gpt 
  6. Build your own LLM-powered automation: https://queryvary.com 

Author: Stefan Altmann

Recent Blogs posted from our team

Build Better Workflows
in Just 5 Minutes