Context Windows Explained: Why Memory Matters More Than Speed

Context window size is the silent killer of LLM applications. You can have the fastest GPU in the world, but if you can't fit your context in memory, speed is irrelevant. This article explains why memory capacity - not compute speed - is often the binding constraint for real-world LLM deployments.

What Is a Context Window?

The context window is the maximum amount of text (measured in tokens) that an LLM can "see" at once. When you send a prompt to an LLM, the context window includes:

System prompt: Instructions defining the model's behavior
Conversation history: Previous messages in a chat
User input: The current query or document
Generated output: The response being created

Modern LLMs support increasingly large context windows: Claude supports 200K tokens, GPT-4 Turbo supports 128K, and Llama 3.x supports up to 128K. But supporting a context window and actually using it are different things.

The KV Cache: Where Memory Goes

Here's the technical detail that changes everything: transformer models don't just need memory for model weights. They need memory for the KV (Key-Value) cache - a data structure that grows linearly with context length.

How the KV Cache Works

During inference, each token attends to all previous tokens. To avoid recomputing attention for every token, models cache the Key and Value projections for each layer. This cache must be stored in GPU memory.

KV Cache Size = 2 × layers × heads × head_dim × context_length × bytes_per_value

For Llama 70B at FP16 with 128K context: 2 × 80 × 64 × 128 × 131,072 × 2 ≈ 39GB

That 39GB is in addition to the ~140GB needed for model weights. So running Llama 70B with a 128K context requires ~180-200GB of GPU memory.

The Memory Math

Total GPU memory needed = Model weights + KV cache + Activations + Overhead. For large models with long contexts, the KV cache can exceed the model weights themselves.

KV Cache Size by Model and Context

Model	Weights (FP16)	KV Cache (32K)	KV Cache (128K)
Llama 7B	14 GB	~2 GB	~8 GB
Llama 13B	26 GB	~4 GB	~16 GB
Llama 70B	140 GB	~10 GB	~39 GB
Llama 405B	810 GB	~45 GB	~180 GB

Why This Matters: Real-World Constraints

The H100 Ceiling

An NVIDIA H100 has 80GB of memory. Let's see what that actually allows:

Model	Precision	Max Context on Single H100
Llama 7B	FP16	128K+ (no constraint)
Llama 13B	FP16	~100K
Llama 70B	FP16	❌ Doesn't fit
Llama 70B	INT4	~40K (limited)

The H100 can't run Llama 70B at full precision at all, let alone with a meaningful context window. You need multiple GPUs with tensor parallelism - adding complexity, cost, and latency.

The Mac Studio Advantage

With 512GB of unified memory, the Mac Studio M3 Ultra operates in a completely different regime:

Model	Precision	Max Context on Mac Studio 512GB
Llama 7B	FP16	128K+ (model limit)
Llama 13B	FP16	128K+ (model limit)
Llama 70B	FP16	128K+ (model limit)
Llama 405B	INT4	~50K+

The constraint becomes the model's trained context limit, not hardware. That's a fundamentally different situation.

Why Long Context Matters

"But I don't need 100K tokens!" - here's why you might:

Document Processing

A typical book: 50,000-100,000 tokens
A legal contract: 10,000-50,000 tokens
A research paper: 8,000-15,000 tokens
An hour of meeting transcript: 15,000-20,000 tokens

Codebase Analysis

A medium codebase for review: 20,000-50,000 tokens
Multiple related files for refactoring: 10,000-30,000 tokens
Full project context for debugging: 30,000-100,000 tokens

Conversation History

Multi-hour support conversation: 20,000-50,000 tokens
Ongoing project assistance: 50,000-100,000+ tokens
RAG with extensive retrieved context: 10,000-50,000 tokens

The Truncation Problem

When your context exceeds available memory, you must truncate - losing information. For many applications, this isn't acceptable. Would you want your legal AI to "forget" the first half of a contract?

Memory Bandwidth vs. Capacity

Some argue that the H100's superior memory bandwidth (3.35TB/s vs Mac Studio's 800GB/s) makes up for lower capacity. This is wrong for two reasons:

Bandwidth doesn't help if data doesn't fit: You can't stream data you can't store. If your context exceeds GPU memory, bandwidth is irrelevant.
LLM inference is memory-bound anyway: Autoregressive generation is limited by memory bandwidth, but the bottleneck is similar on both platforms for inference (training is different).

The H100's bandwidth advantage matters for training large batches and high-throughput inference on small models. For large model inference with long contexts, capacity is king.

Practical Implications

For Developers

Calculate your actual memory needs: weights + KV cache + overhead
Don't assume you can use a model's full context window
Consider unified memory if you need large contexts at full precision

For Product Teams

Memory constraints shape what features are possible
Truncation degrades quality - often silently
Infrastructure choice determines your context ceiling

For Infrastructure Teams

Multi-GPU setups add latency from tensor parallelism communication
Unified memory eliminates CPU-GPU transfer overhead
Plan for worst-case context lengths, not average

Conclusion

Context window size is constrained by GPU memory, not model architecture. The KV cache grows linearly with context length, and for large models, it quickly exceeds available VRAM on traditional GPUs.

This is why 512GB unified memory is transformative for LLM applications: it removes memory as the binding constraint, letting you use models' full context capabilities without multi-GPU complexity.

When evaluating infrastructure for LLM deployment, don't just compare TFLOPS or bandwidth. Calculate your actual memory needs at your required context lengths. You might find that the "slower" hardware is the only one that can actually run your workload.

Need Long Context Inference?

Run Llama 70B with 128K+ token contexts on a single machine. 512GB unified memory, no multi-GPU complexity.

Get Early Access

Nick

Founder at MetalCloud. Building the future of Apple Silicon cloud computing.