Context window size is the silent killer of LLM applications. You can have the fastest GPU in the world, but if you can't fit your context in memory, speed is irrelevant. This article explains why memory capacity - not compute speed - is often the binding constraint for real-world LLM deployments.
What Is a Context Window?
The context window is the maximum amount of text (measured in tokens) that an LLM can "see" at once. When you send a prompt to an LLM, the context window includes:
- System prompt: Instructions defining the model's behavior
- Conversation history: Previous messages in a chat
- User input: The current query or document
- Generated output: The response being created
Modern LLMs support increasingly large context windows: Claude supports 200K tokens, GPT-4 Turbo supports 128K, and Llama 3.x supports up to 128K. But supporting a context window and actually using it are different things.
The KV Cache: Where Memory Goes
Here's the technical detail that changes everything: transformer models don't just need memory for model weights. They need memory for the KV (Key-Value) cache - a data structure that grows linearly with context length.
How the KV Cache Works
During inference, each token attends to all previous tokens. To avoid recomputing attention for every token, models cache the Key and Value projections for each layer. This cache must be stored in GPU memory.
That 39GB is in addition to the ~140GB needed for model weights. So running Llama 70B with a 128K context requires ~180-200GB of GPU memory.
The Memory Math
Total GPU memory needed = Model weights + KV cache + Activations + Overhead. For large models with long contexts, the KV cache can exceed the model weights themselves.
KV Cache Size by Model and Context
| Model | Weights (FP16) | KV Cache (32K) | KV Cache (128K) |
|---|---|---|---|
| Llama 7B | 14 GB | ~2 GB | ~8 GB |
| Llama 13B | 26 GB | ~4 GB | ~16 GB |
| Llama 70B | 140 GB | ~10 GB | ~39 GB |
| Llama 405B | 810 GB | ~45 GB | ~180 GB |
Why This Matters: Real-World Constraints
The H100 Ceiling
An NVIDIA H100 has 80GB of memory. Let's see what that actually allows:
| Model | Precision | Max Context on Single H100 |
|---|---|---|
| Llama 7B | FP16 | 128K+ (no constraint) |
| Llama 13B | FP16 | ~100K |
| Llama 70B | FP16 | ❌ Doesn't fit |
| Llama 70B | INT4 | ~40K (limited) |
The H100 can't run Llama 70B at full precision at all, let alone with a meaningful context window. You need multiple GPUs with tensor parallelism - adding complexity, cost, and latency.
The Mac Studio Advantage
With 512GB of unified memory, the Mac Studio M3 Ultra operates in a completely different regime:
| Model | Precision | Max Context on Mac Studio 512GB |
|---|---|---|
| Llama 7B | FP16 | 128K+ (model limit) |
| Llama 13B | FP16 | 128K+ (model limit) |
| Llama 70B | FP16 | 128K+ (model limit) |
| Llama 405B | INT4 | ~50K+ |
The constraint becomes the model's trained context limit, not hardware. That's a fundamentally different situation.
Why Long Context Matters
"But I don't need 100K tokens!" - here's why you might:
Document Processing
- A typical book: 50,000-100,000 tokens
- A legal contract: 10,000-50,000 tokens
- A research paper: 8,000-15,000 tokens
- An hour of meeting transcript: 15,000-20,000 tokens
Codebase Analysis
- A medium codebase for review: 20,000-50,000 tokens
- Multiple related files for refactoring: 10,000-30,000 tokens
- Full project context for debugging: 30,000-100,000 tokens
Conversation History
- Multi-hour support conversation: 20,000-50,000 tokens
- Ongoing project assistance: 50,000-100,000+ tokens
- RAG with extensive retrieved context: 10,000-50,000 tokens
The Truncation Problem
When your context exceeds available memory, you must truncate - losing information. For many applications, this isn't acceptable. Would you want your legal AI to "forget" the first half of a contract?
Memory Bandwidth vs. Capacity
Some argue that the H100's superior memory bandwidth (3.35TB/s vs Mac Studio's 800GB/s) makes up for lower capacity. This is wrong for two reasons:
- Bandwidth doesn't help if data doesn't fit: You can't stream data you can't store. If your context exceeds GPU memory, bandwidth is irrelevant.
- LLM inference is memory-bound anyway: Autoregressive generation is limited by memory bandwidth, but the bottleneck is similar on both platforms for inference (training is different).
The H100's bandwidth advantage matters for training large batches and high-throughput inference on small models. For large model inference with long contexts, capacity is king.
Practical Implications
For Developers
- Calculate your actual memory needs: weights + KV cache + overhead
- Don't assume you can use a model's full context window
- Consider unified memory if you need large contexts at full precision
For Product Teams
- Memory constraints shape what features are possible
- Truncation degrades quality - often silently
- Infrastructure choice determines your context ceiling
For Infrastructure Teams
- Multi-GPU setups add latency from tensor parallelism communication
- Unified memory eliminates CPU-GPU transfer overhead
- Plan for worst-case context lengths, not average
Conclusion
Context window size is constrained by GPU memory, not model architecture. The KV cache grows linearly with context length, and for large models, it quickly exceeds available VRAM on traditional GPUs.
This is why 512GB unified memory is transformative for LLM applications: it removes memory as the binding constraint, letting you use models' full context capabilities without multi-GPU complexity.
When evaluating infrastructure for LLM deployment, don't just compare TFLOPS or bandwidth. Calculate your actual memory needs at your required context lengths. You might find that the "slower" hardware is the only one that can actually run your workload.
Need Long Context Inference?
Run Llama 70B with 128K+ token contexts on a single machine. 512GB unified memory, no multi-GPU complexity.
Get Early Access