The Rise of the Transformers – a simplified version Part 2
This post continues from my previous article—click here to read it first.
In this article, I’ll begin unpacking the some of the key steps that need to be completed before we get to the attention mechanism. Let’s first understand how a large language model (LLM) interacts with a human user.
Humans communicate in natural language—sentences, paragraphs, and conversation. But LLMs, like ChatGPT, operate purely with numbers. So, the first step in any interaction is to convert the text we type into a format the model can understand.
This transformation happens in a few key stages:
1. Text Input (from user)
2. Tokenization – breaking text into smaller units the model can process
3. Numerical Conversion – each token is mapped to a unique id corresponding to that token. Each unique id is then mapped to a high-dimensional vector (a list of numbers) that captures the meaning of each token in a format the model can understand
4. Positional Encoding – adding information about the order of the tokens, since models don't inherently understand sequence
Let’s expand on this.
When you type a sentence into a system like ChatGPT, the first step is breaking it down into smaller pieces called tokens. These tokens may or may not be full words—they can be parts of words, whole words, or even spaces and punctuation.
Let’s look at an example:
Input sentence:
Transformers are great.
ChatGPT doesn’t see this as one sentence. Instead, it breaks it into the following tokens:
• Transform
• ers
• are (note the leading space)
• great (again, space is included)
• .
Each token is highlighted below in a different color to show how the model separates them.
These tokens are then converted into token IDs and passed through the next steps in the model’s processing pipeline.
Below are the token IDs ChatGPT assigned to these tokens
[12200, 409, 553, 2212]
One might reasonably ask: why do large language models (LLMs) split sentences into tokens instead of handling full words or sentences as a whole?
Words are broken down into tokens to match a predefined vocabulary list that the LLM was trained on.
Think of the predefined vocabulary list like a custom dictionary that the AI model uses to read and understand text. This dictionary doesn’t just include full words like “apple” or “fast” it also includes pieces of words, prefixes, suffixes, and even spaces or punctuation. Each item in the dictionary is called a token, and each token has a unique ID number.
So where did this custom dictionary or vocabulary come from?
During the LLM Model training, the model looked at a massive amount of text—like books, articles, websites, and conversations. It learned to spot patterns in how words and pieces of words appear together. The most common and useful combinations were added to its custom dictionary or what in AI speak is the models Vocabulary
So when you enter text like "Transformers are great.", the tokenizer tries to match pieces of the sentence to known tokens in its vocabulary. If a full word like "Transformers" is not in the vocabulary as a single token, it will break it down into smaller known parts like "Transform" and "ers".
This is why:
• Common words may be a single token (e.g., "the" or "are"),
• Rare or complex words are split into multiple tokens (e.g., "Transformers" → "Transform" + "ers"),
• Even spaces before words are part of tokens (e.g., " great" includes the space).
This system helps keep the vocabulary manageable while allowing the model to understand any input, even if it’s a new word or name it has never seen before.
This Process of breaking a sentence into individual tokens is called tokenization.
To recap: before a model can understand anything we type, it first breaks the sentence into tokens, converts each token into a number, then finds a meaningful vector that represents the token. All of this happens before the model even starts “paying attention.”
In the next part of this series, I’ll explore how numerical conversion and positional encoding prepare tokens for the attention mechanism—the core process that brings them to life. Stay tuned, and feel free to share your questions or comments below!
Comments
Post a Comment