How LLMs Actually Work: From Tokens to Transformers

You use ChatGPT, Gemini, or Claude every day. You type something. It responds. Feels like magic. It is not magic. Under the hood, a predictable set of steps runs every single time. This post breaks down each one.

What Is an LLM?

LLM stands for Large Language Model. It is an AI system trained to understand and generate human language.

The key phrase is "generate." When you search on Google, it finds existing content. When you talk to an LLM, it creates a response on the spot, word by word, based on patterns learned from training data.

OpenAI trained GPT on a massive slice of the internet. Your tweets, LinkedIn posts, articles, books. All of it became training data. The model learned patterns from that data. Now it uses those patterns to generate responses.

What Does GPT Actually Stand For?

GPT means Generative Pre-Trained Transformer. Each word matters.

Generative: It creates output. It does not search or retrieve. It generates.
Pre-Trained: It learned from a large dataset before you ever talked to it. That training is why it knows things.
Transformer: This is the actual architecture running underneath. More on this below.

Every major LLM today, Gemini, Claude, Mistral, all of them, are transformers. OpenAI named their product GPT because that is literally what it is.

How a Transformer Works

In 2017, Google published a paper titled Attention Is All You Need. That paper introduced the transformer architecture. Every LLM you use today is built on it.

The core idea is simple: predict the next token.

That is the entire job of a transformer. Give it an input sequence, and it tells you what comes next.

Here is the loop:

You type: "Hey there"
The transformer predicts the next token: "I"
You feed "Hey there I" back into the transformer
It predicts: "am"
Repeat until the model outputs an end token

That is it. One token at a time. This is why LLMs need GPU power. For a short reply, the transformer runs dozens of times. For a long response, it runs hundreds.

What Is a Token?

Tokens are how LLMs read your input. Your text does not go into the model as letters. It gets converted into numbers first.

A token is a chunk of text mapped to a number. Think of it like a lookup table.

"hey" → 225216
"there" → 3274
"lohit" → broken into multiple tokens

Every model has its own tokenization system. GPT-4o tokenizes differently than Gemini. Same input, different numbers.

The process:

You type text
Tokenization converts it to a list of numbers
Those numbers go into the transformer
The transformer predicts the next number
De-tokenization converts numbers back to text
You see the response

Build Your Own Tokenizer in Python

OpenAI ships a library called tiktoken that lets you tokenize and de-tokenize text for any GPT model.

import tiktoken

# Load the encoder for GPT-4o
encoder = tiktoken.encoding_for_model("gpt-4o")

text = "Hey there, my name is lohit"

# Tokenize
tokens = encoder.encode(text)
print(tokens)
# Output: a list of numbers

# De-tokenize
decoded = encoder.decode(tokens)
print(decoded)
# Output: "Hey there, my name is lohit"

Setup steps:

python -m venv venv
source venv/bin/activate
pip install tiktoken
pip freeze > requirements.txt

This is the full tokenization cycle in code. Text in, numbers in, text out.

Vector Embeddings: Giving Words Meaning

Tokens are numbers. But numbers alone carry no meaning. The number 56 does not tell the model anything about what "dog" means.

Vector embeddings solve this. They assign coordinates to tokens in a multi-dimensional space, placing related words near each other.

Think of a 2D graph:

"dog" and "cat" plot close together (both animals)
"Paris" and "India" plot close together (both places)
"Eiffel Tower" and "India Gate" plot close together (both landmarks)

The distance and direction between points captures real-world relationships. If you move from "Paris" to "India" in embedding space, the same movement from "Eiffel Tower" lands you near "India Gate." Same relationship, different entities.

This is how the model understands meaning without understanding language the way humans do. It uses geometry.

In practice, embeddings are not 2D. They are thousands of dimensions. But the principle is the same.

Positional Encoding: Order Matters

Here is a problem with vector embeddings alone.

"Dog ate cat" → tokens: dog, ate, cat
"Cat ate dog" → tokens: cat, ate, dog

The vector embeddings for both sentences contain the same three words. They look identical to the model. But the meaning is completely different.

Positional encoding fixes this. Before the tokens enter the transformer, each one gets stamped with its position in the sentence.

dog at position 0
ate at position 1
cat at position 2

Now the model knows the order. "Dog ate cat" and "cat ate dog" produce different representations, and the model treats them differently.

Self-Attention: Letting Words Talk to Each Other

After positional encoding, tokens go through the attention mechanism.

Self-attention lets each token look at every other token in the sentence and adjust its meaning based on context.

Example: the word "bank"

"river bank" → bank means a riverbank
"ICICI bank" → bank means a financial institution

Same word. Same position. Different meaning. Self-attention lets the word "river" influence how "bank" is represented. The vector for "bank" changes based on its neighbors.

Multi-Head Attention: Seeing Multiple Things at Once

Multi-head attention runs self-attention multiple times in parallel, each time focusing on a different aspect of the input.

When you see a dog sleeping in a passing train:

One part of your brain notices it is a dog
Another notices it is a Labrador
Another notices it is near an open door
Another tracks the speed of the train

You process all of this at once. Multi-head attention does the same. It builds a richer, more complete representation of the input by attending to multiple aspects simultaneously.

The Full Pipeline

Here is how a single response gets generated:

Tokenization: Your text becomes a list of numbers
Input Embeddings: Numbers get converted to vectors with semantic meaning
Positional Encoding: Position data gets added to each vector
Self-Attention: Tokens look at each other and update their meanings
Multi-Head Attention: Multiple attention passes run in parallel for richer context
Feed Forward Layer: A neural network processes the attention output
Linear Layer: A probability distribution is generated over all possible next tokens
Softmax: The highest-probability token is selected
De-tokenization: The output numbers convert back to text
Repeat: The new token gets appended and the whole loop runs again

Every token in a response goes through all ten steps.

What This Means for You as a Developer

There is a clear line between ML researchers and application developers.

ML researchers build foundation models. They live in the math and the architecture. They write papers like Attention Is All You Need.

Application developers build products on top of those models. You do not need to implement a transformer to build with LLMs. You need to understand what they do, how they process input, and what their limits are.

Knowing the pipeline above gives you that understanding. You will write better prompts. You will design better systems. You will debug outputs more effectively.

The magic is not magic. It is next-token prediction, repeated.