Context Windows Explained: Why Memory Matters More Than Speed

Context window size is the silent killer of LLM applications. You can have the fastest GPU in the world, but if you can't fit your context in memory, speed is irrelevant. This article explains why memory capacity - not compute speed - is often the binding constraint for real-world LLM deployments.

What Is a Context Window?

The context window is the maximum amount of text (measured in tokens) that an LLM can "see" at once. When you send a prompt to an LLM, the context window includes:

Modern LLMs support increasingly large context windows: Claude supports 200K tokens, GPT-4 Turbo supports 128K, and Llama 3.x supports up to 128K. But supporting a context window and actually using it are different things.

The KV Cache: Where Memory Goes

Here's the technical detail that changes everything: transformer models don't just need memory for model weights. They need memory for the KV (Key-Value) cache - a data structure that grows linearly with context length.

How the KV Cache Works

During inference, each token attends to all previous tokens. To avoid recomputing attention for every token, models cache the Key and Value projections for each layer. This cache must be stored in GPU memory.

KV Cache Size = 2 × layers × heads × head_dim × context_length × bytes_per_value
For Llama 70B at FP16 with 128K context: 2 × 80 × 64 × 128 × 131,072 × 2 ≈ 39GB

That 39GB is in addition to the ~140GB needed for model weights. So running Llama 70B with a 128K context requires ~180-200GB of GPU memory.

The Memory Math

Total GPU memory needed = Model weights + KV cache + Activations + Overhead. For large models with long contexts, the KV cache can exceed the model weights themselves.

KV Cache Size by Model and Context

Model Weights (FP16) KV Cache (32K) KV Cache (128K)
Llama 7B 14 GB ~2 GB ~8 GB
Llama 13B 26 GB ~4 GB ~16 GB
Llama 70B 140 GB ~10 GB ~39 GB
Llama 405B 810 GB ~45 GB ~180 GB

Why This Matters: Real-World Constraints

The H100 Ceiling

An NVIDIA H100 has 80GB of memory. Let's see what that actually allows:

Model Precision Max Context on Single H100
Llama 7B FP16 128K+ (no constraint)
Llama 13B FP16 ~100K
Llama 70B FP16 ❌ Doesn't fit
Llama 70B INT4 ~40K (limited)

The H100 can't run Llama 70B at full precision at all, let alone with a meaningful context window. You need multiple GPUs with tensor parallelism - adding complexity, cost, and latency.

The Mac Studio Advantage

With 512GB of unified memory, the Mac Studio M3 Ultra operates in a completely different regime:

Model Precision Max Context on Mac Studio 512GB
Llama 7B FP16 128K+ (model limit)
Llama 13B FP16 128K+ (model limit)
Llama 70B FP16 128K+ (model limit)
Llama 405B INT4 ~50K+

The constraint becomes the model's trained context limit, not hardware. That's a fundamentally different situation.

Why Long Context Matters

"But I don't need 100K tokens!" - here's why you might:

Document Processing

Codebase Analysis

Conversation History

The Truncation Problem

When your context exceeds available memory, you must truncate - losing information. For many applications, this isn't acceptable. Would you want your legal AI to "forget" the first half of a contract?

Memory Bandwidth vs. Capacity

Some argue that the H100's superior memory bandwidth (3.35TB/s vs Mac Studio's 800GB/s) makes up for lower capacity. This is wrong for two reasons:

  1. Bandwidth doesn't help if data doesn't fit: You can't stream data you can't store. If your context exceeds GPU memory, bandwidth is irrelevant.
  2. LLM inference is memory-bound anyway: Autoregressive generation is limited by memory bandwidth, but the bottleneck is similar on both platforms for inference (training is different).

The H100's bandwidth advantage matters for training large batches and high-throughput inference on small models. For large model inference with long contexts, capacity is king.

Practical Implications

For Developers

For Product Teams

For Infrastructure Teams

Conclusion

Context window size is constrained by GPU memory, not model architecture. The KV cache grows linearly with context length, and for large models, it quickly exceeds available VRAM on traditional GPUs.

This is why 512GB unified memory is transformative for LLM applications: it removes memory as the binding constraint, letting you use models' full context capabilities without multi-GPU complexity.

When evaluating infrastructure for LLM deployment, don't just compare TFLOPS or bandwidth. Calculate your actual memory needs at your required context lengths. You might find that the "slower" hardware is the only one that can actually run your workload.

Need Long Context Inference?

Run Llama 70B with 128K+ token contexts on a single machine. 512GB unified memory, no multi-GPU complexity.

Get Early Access
N

Nick

Founder at MetalCloud. Building the future of Apple Silicon cloud computing.