Behind the Prompt: The Mechanics of LLM Inference

01

The First Mystery

Where Does Understanding Happen?

When people first learn about transformers, they quickly encounter two terms:

Attention
Feed-Forward Network, or FFN

At first glance, it is tempting to think that attention is where all the intelligence resides. After all, attention determines which tokens are related to one another.

Consider the sentence:

The animal did not cross the road because it was tired.

How does the model know that it refers to the animal and not the road?

This is where attention shines. Attention allows tokens to discover relationships with other tokens. It helps the model determine which parts of the prompt are relevant to the current token being processed.

But this creates another question.

If attention merely identifies relevant information, where does the actual thinking happen?

02

If Attention Retrieves Information

What Does the FFN Do?

Refer to same mental model that we talked about in previous section. Once the student has found the relevant pages in a textbook, the real work begins. The student reads, interprets, compares ideas, connects concepts, and develops understanding.

This is where the Feed-Forward Network enters the picture.

In an oversimplified but surprisingly useful mental model:

Attention

Retrieves

Attention retrieves contextual information and identifies relationships between different pieces of information.

FFN

Computes

The FFN takes the contextual information gathered by attention and transforms it into richer internal representations.

Without attention, the model would not know where to look. And without the FFN, the model would know where to look but would struggle to make meaningful use of the information.

Of course, reality is more nuanced than this simple description. Reasoning does not magically happen inside a single FFN layer. Instead, reasoning emerges from repeated cycles across many transformer layers as below.

Attention -> FFN -> Attention -> FFN -> Attention -> FFN

Each layer refines the representation a little further. Individually, the layers appear simple. Collectively, they produce behavior that looks remarkably like reasoning. Understanding this is important because it leads us to the next question.

If the model is constantly processing information internally, what exactly happens after we submit a prompt?

03

Before The First Token

What Happens Before the First Token Appears?

Most people imagine that a Large Language Model starts generating an answer immediately after receiving a prompt.

Surprisingly, it doesn’t.

Before a single word appears on the screen, the model has already performed a significant amount of work. It has read the entire prompt, established relationships between tokens, processed those relationships through multiple transformer layers, and built the internal representations required for generation.

This hidden stage is known as Prefill.

You can think of it as the model’s reading phase.

Imagine a student sitting in an examination hall. Before writing an answer, the student first reads the entire question, understands what is being asked, identifies relevant concepts, and formulates an approach.

That is exactly what prefill does. At the end of prefill, the model understands the prompt. Yet nothing has appeared on the screen. The model has completed its internal preparation, but it has not started writing the answer.

04

Autoregressive Generation

Why Doesn't the Model Generate the Entire Answer at Once?

If the model already understands the prompt, why not simply produce the entire answer immediately?

The answer lies in how autoregressive language models work. LLMs generate text one token at a time. After prefill, the model predicts the most likely next token. Then it predicts the token after that. And then the next one.

This stage is called Decode.

Prefill

Understanding

If prefill is equivalent to reading the question paper, decode is equivalent to writing the answer.

Decode

Generation

After prefill, the model predicts the most likely next token. Then it predicts the token after that.

The distinction is subtle but important.

Prefill focuses on understanding. And Decode focuses on generation.

And that leads to another interesting question.

If the model generates tokens one at a time, does it have to re-read the entire prompt every time a new token is generated?

Fortunately, the answer is no.

05

The Hidden Hero

KV Cache

Without optimization, token generation would be painfully inefficient.

Imagine generating token number 1001 after processing a 1000-token prompt.

Without some form of caching, the model would need to recompute attention for all previous tokens every time a new token was generated. That would be extraordinarily expensive. To avoid this, transformers create something called a KV Cache during prefill. The cache stores previously computed Keys and Values, allowing future tokens to reuse earlier computations. As a result, the model no longer needs to repeatedly rebuild the entire attention history. This dramatically improves performance.

However, many people stop here and draw the wrong conclusion.

The wrong conclusion

They assume that KV cache solves the inference problem entirely. It doesn't. It solves one problem, but another still remains.

06

KV Cache Pressure

If KV Cache Is So Useful, Why Not Store It Forever?

At this point, KV cache sounds like a perfect solution. The model computes Keys and Values once during prefill and then simply reuses them during decode. However, every optimization introduces a new challenge.

As context windows grow larger, KV cache size grows as well.

For short conversations this is rarely a concern. However, in long-running conversations, RAG workflows, and agentic systems that continuously accumulate context, the KV cache can consume a significant amount of GPU memory. In some scenarios, the memory required for the KV cache may rival or even exceed the memory consumed by the model weights themselves.

Naturally, engineers began asking another question:

Can we compress the KV cache without significantly affecting model quality?

This led to a growing area of research focused on KV-cache quantization and compression techniques.

One example is TurboQuant, which attempts to reduce the memory footprint of the KV cache while preserving model quality. The goal is not merely to save memory, but also to improve inference efficiency by reducing memory traffic and increasing the effective context that can fit within GPU memory.

However, KV-cache optimization is not a one-size-fits-all problem.

Different model architectures respond differently to cache compression techniques. Some approaches significantly improve decode performance but introduce overhead during prefill. Others provide modest gains across both phases. In practice, selecting the right optimization often depends on the model architecture, workload characteristics, context length, and latency requirements.

As with most engineering problems, there is no universal winner. The best solution depends on what is being optimized.

Not All KV Cache Data Is Equally Sensitive

An interesting observation from recent research is that Keys and Values behave differently under quantization.

At first glance, it may seem reasonable to compress both equally. In reality, they play different roles inside the attention mechanism.

Keys are heavily involved in determining attention scores. Small errors in the Key tensors can alter which tokens receive attention and by how much, potentially affecting model accuracy.

Values, on the other hand, are used after the attention weights have already been determined. As a result, Value tensors often tolerate more aggressive compression.

Keys

Where to look

Keys are heavily involved in determining attention scores.

Values

What information is retrieved

Values are used after the attention weights have already been determined.

A useful way to think about this is:

Keys help determine where to look.
Values help determine what information is retrieved.

Because of this distinction, many KV-cache optimization techniques prioritize preserving Key accuracy while applying more aggressive compression strategies to Values.

Another practical observation is that Value tensors often contain structural patterns that allow memory savings while maintaining the tensor shapes expected by the model.

Maintaining these shapes is important because the attention mechanism relies on predictable tensor dimensions for efficient execution.

This subtle difference between Keys and Values is one reason KV-cache optimization remains an active area of research rather than a solved problem.

07

Memory Bandwidth

If KV Cache Exists, Why Do AI Engineers Obsess Over Memory Bandwidth?

At this point, a natural question arises.

If the model already understands the prompt and can reuse previous attention computations through KV cache then : Why does inference still require such powerful hardware?

Why are AI engineers constantly talking about HBM, GPU memory bandwidth, and specialized accelerators?

The answer lies in a distinction that is easy to overlook. KV cache reduces attention recomputation. It does not eliminate the need to read model weights. Every layer still requires access to its parameters. And modern LLMs contain an enormous number of parameters.

To understand why this matters, we need to look inside the GPU.

08

Inside The GPU

Understanding GPU Memory Bandwidth

When people discuss memory bandwidth in AI systems, they are usually referring to bandwidth inside the GPU, not PCIe bandwidth between the CPU and GPU.

A simplified view looks like this:

HBM Memory

L2 Cache

Registers

Tensor Cores

The tensor cores are the computational engines of the GPU.

They continuously request:

Model weights
Activations
KV cache data

Think of tensor cores as workers on a factory floor. The workers can process materials extremely quickly, but only if the materials arrive on time. If the supply chain cannot keep up, workers sit idle.

Similarly, if data cannot be delivered from memory fast enough, tensor cores remain underutilized. This is why memory bandwidth has become such an important design parameter in modern AI hardware.

But where does this bandwidth actually go?

A simple calculation helps answer that question.

09

A Rough 70B Model Calculation

Consider a 70-billion-parameter model.

Assuming FP16 precision, each parameter occupies approximately 2 bytes.

That means:

70 billion x 2 bytes ~= 140 GB

of model weights.

Now suppose the model generates roughly 20 tokens per second.

A rough estimate suggests:

140 GB x 20 ~= 2.8 TB/s

This is not a precise calculation, but it provides useful intuition.

Suddenly, the enormous bandwidth numbers associated with modern AI GPUs begin to make sense.

And this observation becomes even more important when we move beyond traditional chatbots.

10

Agentic AI

Why Agentic AI Changes the Equation

A traditional chatbot usually follows a simple pattern:

Traditional chatbot

Prompt -> Prefill -> Decode -> Response

Agentic systems are different.

An agent may:

Search documents
Call tools
Retrieve knowledge
Consult memory
Build new prompts
Re-enter the inference cycle repeatedly

Each cycle introduces additional prompt processing.

As a result, modern agentic workloads often spend a significant amount of time reading and processing context. This makes prompt processing efficiency, KV-cache management, and memory bandwidth even more important.

And naturally, this leads to the next question.

11

KV Cache Reuse

If We Already Computed the KV Cache, Why Compute It Again?

At this point, cache reuse sounds straightforward. If the prompt remains unchanged, reuse the KV cache and avoid repeating prefill. However, agentic systems introduce a complication.

An agent rarely performs a single inference pass. Instead, it continuously alternates between reasoning and action.

A typical workflow may look like:

User Request
Reasoning
Tool Call
Tool Result
Reasoning
Another Tool Call
Updated Context

After every tool call, new information is added to the context window.

At first glance, this appears to invalidate the entire KV cache.

If the prompt has changed, shouldn't the model be forced to rebuild everything from scratch?

Fortunately, modern inference systems can often do better. Rather than treating the KV cache as a single monolithic object, it can be divided into reusable segments or chunks.

Consider a simplified example:

Chunk A: System Prompt

Chunk B: User Request

Chunk C: Previous Reasoning

Chunk D: Tool Output

Chunk E: Current Reasoning

Suppose the agent performs another tool call and receives new information.

Only Chunk D changes.

The system can keep the KV cache associated with Chunks A, B, and C while recomputing only the portions affected by the new tool output.

Conceptually, it behaves much like incremental compilation in software development. When a single source file changes, the compiler does not rebuild the entire project. It rebuilds only the components affected by the modification.

KV-cache reuse applies a similar idea to inference.

Instead of repeatedly processing thousands of unchanged tokens, the inference engine reuses previously computed cache entries and rebuilds only the portions that are no longer valid.

This significantly reduces prompt-processing overhead, particularly in agentic workflows where large portions of the context remain unchanged across multiple reasoning steps.

As agents become more sophisticated, cache reuse evolves from a useful optimization into a fundamental requirement. Without it, the same context may be processed dozens of times during a single task, consuming GPU resources without contributing any new information.

In many real-world deployments, the fastest token is not the one generated more quickly. It is the token whose computation was never repeated in the first place.

12

Mixture of Experts

Enter Mixture of Experts

If bandwidth is becoming the bottleneck, why activate the entire model for every token?

This question led researchers toward Mixture of Experts architectures.

In a traditional dense model, every token activates the same FFN layers. Whether the token is simple or complex, the entire model participates.

MoE takes a different approach. Instead of activating every expert, a routing mechanism selects only a small subset of experts for each token. This reduces the number of active parameters.

But notice something interesting.

When discussing MoE, people usually talk about expert FFN layers, not expert attention layers.

Why?

Because most transformer parameters reside inside FFN layers.

Attention is essential for communication between tokens and typically remains dense.

FFN layers, on the other hand, account for the majority of parameters and are therefore the most attractive target for selective activation.

Traditional dense model

Every token activates the same FFN layers

Whether the token is simple or complex, the entire model participates.

MoE

A routing mechanism selects only a small subset of experts

Total parameters can become very large. Active parameters remain relatively small.

As a result:

Total parameters can become very large.
Active parameters remain relatively small.
Memory traffic is reduced.
Inference becomes more efficient.

13

Expert Memory

If Only a Few Experts Are Active, Why Keep All Experts in GPU Memory?

At first glance, Mixture of Experts appears to solve the inference problem elegantly.

For any given token, only a small subset of experts is activated. If only a few experts are needed, one might assume that memory requirements would shrink proportionally.

Unfortunately, the reality is more complicated.

Although only a small number of experts are active for each token, the system must still keep the entire expert pool available. In large MoE models, the combined size of all experts can become enormous, often exceeding the memory capacity of a single GPU.

One common solution is expert offloading.

Frequently used experts remain in GPU memory, while less frequently used experts are stored in system memory, or RAM, and loaded when required.

While this approach reduces GPU memory requirements, it introduces a new bottleneck.

The moment an expert must be fetched from host memory, data must travel across PCIe. Compared to GPU HBM bandwidth, PCIe is significantly slower.

A modern AI GPU may provide several terabytes per second of HBM bandwidth, while PCIe bandwidth is typically measured in tens of gigabytes per second.

As a result, an expert that resides in RAM may introduce noticeable latency whenever it is activated.

This naturally raises another question:

If certain experts are rarely used, do we really need to keep them at all?

This idea motivates techniques such as REAP.

REAP: Keeping the Experts That Matter

REAP focuses on identifying experts that contribute little to real-world inference workloads and removing them from the model.

The intuition is straightforward.

In many deployments, expert utilization is highly uneven. Some experts are selected frequently, while others may be activated only rarely.

If a subset of experts contributes minimally to model behavior, retaining them consumes memory without providing proportional value.

By pruning underutilized experts, REAP reduces the overall size of the MoE model.

This creates two important benefits.

First, more of the model can fit directly inside GPU memory, reducing dependence on PCIe transfers and expert offloading.

Second, the reduced memory footprint allows the remaining experts to be served more efficiently, improving overall inference performance.

From an infrastructure perspective, the goal is not simply to make the model smaller.

The goal is to maximize the amount of useful model capacity that can remain inside high-bandwidth GPU memory, where it can be accessed at HBM speeds rather than PCIe speeds.

In many ways, REAP extends the same philosophy we encountered earlier with KV-cache optimization and cache reuse:

the fastest data movement is the data movement that never needs to happen.

14

Bringing It All Together

What started as a simple question - “What happens before a Large Language Model answers?” - eventually led us through the most important concepts in modern LLM inference.

We discovered that attention helps tokens find relevant information, while FFN layers transform that information into richer representations.

We saw that reasoning emerges from many rounds of interaction between these components rather than from a single layer.

We learned that before a model generates its first token, it spends time understanding the prompt during prefill.

We explored why decode works differently, why KV cache exists, and why memory bandwidth has become such a critical resource.

We saw how agentic AI amplifies these challenges and why architectures such as Mixture of Experts have become increasingly important.

Most importantly, we learned that inference is not simply about generating text.

Long before the first token appears on the screen, the model has already performed a remarkable amount of work.

The answer begins much earlier than most people realize.