How LLMs Actually Work: From Tokens to Transformers
How LLMs Actually Work: From Tokens to Transformers
You use ChatGPT, Gemini, or Claude every day. You type something. It responds. Feels like magic. It is not magic. Under the hood, a predictable set of steps runs every single time. This post breaks down each one.
What Is an LLM?
LLM stands for Large Language Model. It is an AI system trained to understand and generate human language.
The key phrase is "generate." When you search on Google, it finds existing content. When you talk to an LLM, it creates a response on the spot, word by word, based on patterns learned from training data.
OpenAI trained GPT on a massive slice of the internet. Your tweets, LinkedIn posts, articles, books. All of it became training data. The model learned patterns from that data. Now it uses those patterns to generate responses.
What Does GPT Actually Stand For?
GPT means Generative Pre-Trained Transformer. Each word matters.
- Generative: It creates output. It does not search or retrieve. It generates.
- Pre-Trained: It learned from a large dataset before you ever talked to it. That training is why it knows things.
- Transformer: This is the actual architecture running underneath. More on this below.
Every major LLM today, Gemini, Claude, Mistral, all of them, are transformers. OpenAI named their product GPT because that is literally what it is.
How a Transformer Works
In 2017, Google published a paper titled Attention Is All You Need. That paper introduced the transformer architecture. Every LLM you use today is built on it.
The core idea is simple: predict the next token.
That is the entire job of a transformer. Give it an input sequence, and it tells you what comes next.
Here is the loop:
- You type: "Hey there"
- The transformer predicts the next token: "I"
- You feed "Hey there I" back into the transformer
- It predicts: "am"
- Repeat until the model outputs an end token
That is it. One token at a time. This is why LLMs need GPU power. For a short reply, the transformer runs dozens of times. For a long response, it runs hundreds.
What Is a Token?
Tokens are how LLMs read your input. Your text does not go into the model as letters. It gets converted into numbers first.
A token is a chunk of text mapped to a number. Think of it like a lookup table.
- "hey" → 225216
- "there" → 3274
- "lohit" → broken into multiple tokens
Every model has its own tokenization system. GPT-4o tokenizes differently than Gemini. Same input, different numbers.
The process:
- You type text
- Tokenization converts it to a list of numbers
- Those numbers go into the transformer
- The transformer predicts the next number
- De-tokenization converts numbers back to text
- You see the response
Build Your Own Tokenizer in Python
OpenAI ships a library called tiktoken that lets you tokenize and de-tokenize text for any GPT model.
import tiktoken
# Load the encoder for GPT-4o
encoder = tiktoken.encoding_for_model("gpt-4o")
text = "Hey there, my name is lohit"
# Tokenize
tokens = encoder.encode(text)
print(tokens)
# Output: a list of numbers
# De-tokenize
decoded = encoder.decode(tokens)
print(decoded)
# Output: "Hey there, my name is lohit"
Setup steps:
python -m venv venv
source venv/bin/activate
pip install tiktoken
pip freeze > requirements.txt
This is the full tokenization cycle in code. Text in, numbers in, text out.
Vector Embeddings: Giving Words Meaning
Tokens are numbers. But numbers alone carry no meaning. The number 56 does not tell the model anything about what "dog" means.
Vector embeddings solve this. They assign coordinates to tokens in a multi-dimensional space, placing related words near each other.
Think of a 2D graph:
- "dog" and "cat" plot close together (both animals)
- "Paris" and "India" plot close together (both places)
- "Eiffel Tower" and "India Gate" plot close together (both landmarks)
The distance and direction between points captures real-world relationships. If you move from "Paris" to "India" in embedding space, the same movement from "Eiffel Tower" lands you near "India Gate." Same relationship, different entities.
This is how the model understands meaning without understanding language the way humans do. It uses geometry.
In practice, embeddings are not 2D. They are thousands of dimensions. But the principle is the same.
Positional Encoding: Order Matters
Here is a problem with vector embeddings alone.
- "Dog ate cat" → tokens: dog, ate, cat
- "Cat ate dog" → tokens: cat, ate, dog
The vector embeddings for both sentences contain the same three words. They look identical to the model. But the meaning is completely different.
Positional encoding fixes this. Before the tokens enter the transformer, each one gets stamped with its position in the sentence.
- dog at position 0
- ate at position 1
- cat at position 2
Now the model knows the order. "Dog ate cat" and "cat ate dog" produce different representations, and the model treats them differently.
Self-Attention: Letting Words Talk to Each Other
After positional encoding, tokens go through the attention mechanism.
Self-attention lets each token look at every other token in the sentence and adjust its meaning based on context.
Example: the word "bank"
- "river bank" → bank means a riverbank
- "ICICI bank" → bank means a financial institution
Same word. Same position. Different meaning. Self-attention lets the word "river" influence how "bank" is represented. The vector for "bank" changes based on its neighbors.
Multi-Head Attention: Seeing Multiple Things at Once
Multi-head attention runs self-attention multiple times in parallel, each time focusing on a different aspect of the input.
When you see a dog sleeping in a passing train:
- One part of your brain notices it is a dog
- Another notices it is a Labrador
- Another notices it is near an open door
- Another tracks the speed of the train
You process all of this at once. Multi-head attention does the same. It builds a richer, more complete representation of the input by attending to multiple aspects simultaneously.
The Full Pipeline
Here is how a single response gets generated:
- Tokenization: Your text becomes a list of numbers
- Input Embeddings: Numbers get converted to vectors with semantic meaning
- Positional Encoding: Position data gets added to each vector
- Self-Attention: Tokens look at each other and update their meanings
- Multi-Head Attention: Multiple attention passes run in parallel for richer context
- Feed Forward Layer: A neural network processes the attention output
- Linear Layer: A probability distribution is generated over all possible next tokens
- Softmax: The highest-probability token is selected
- De-tokenization: The output numbers convert back to text
- Repeat: The new token gets appended and the whole loop runs again
Every token in a response goes through all ten steps.
What This Means for You as a Developer
There is a clear line between ML researchers and application developers.
ML researchers build foundation models. They live in the math and the architecture. They write papers like Attention Is All You Need.
Application developers build products on top of those models. You do not need to implement a transformer to build with LLMs. You need to understand what they do, how they process input, and what their limits are.
Knowing the pipeline above gives you that understanding. You will write better prompts. You will design better systems. You will debug outputs more effectively.
The magic is not magic. It is next-token prediction, repeated.