Context Engineering: Why Your Prompts Aren’t the Problem

The Teams Winning With AI Didn’t Write Better Prompts

Engineering teams shipping substantial code volumes are not using different models or writing superior prompts. They have constructed a superior operational environment for their agents. The gap stems not from capability differences—all teams access identical frontier models—but from what they present to those models: how context is structured, memory is managed, feedback loops close, and agents receive a coherent picture of their objectives.

What Context Engineering Actually Is

Context engineering involves strategically populating the context window with precisely calibrated information for each step. Andrej Karpathy defined it as “the delicate art and science of filling the context window with just the right information for the next step.” Shopify CEO Tobi Lutke framed it as “the art of providing all the context for the task to be plausibly solvable by the LLM.” Google’s engineering team emphasized treating “context as a first-class system with its own architecture, lifecycle, and constraints.”

Context engineering differs fundamentally from prompt engineering. Most people conceptualize context as a container—you insert materials, the model reads them, you receive output. This misses what actually occurs. Models lack memory beyond the context window. Everything about your task, codebase, preferences, history, and constraints must exist within that window currently. Without it, the model guesses.

Two engineers face identical-complexity tasks on the same model. Engineer A writes a clean, specific prompt. Engineer B writes a mediocre prompt but maintains a CLAUDE.md file in the repository root, has loaded relevant code examples, and possesses a document explaining team conventions. Engineer B consistently receives superior output. The prompt barely mattered; context accomplished everything.

Why Context Failures Look Like Model Failures

Research on AI agent failures reveals teams misattribute root causes. When output disappoints, the reflexive response involves blaming the model, swapping models, or refining the prompt. The actual cause almost invariably resides in the context.

Anthropic’s engineering team observed that “most agent failures are not model failures. They are context failures.” The guidance emphasizes thoughtfulness—maintaining context that remains informative yet concise. This represents a design constraint rather than a prompting problem.

Birgitta Böckeler at Thoughtworks noted that “the number of options to configure and enrich a coding agent’s context has exploded,” with reliable output coming from teams treating configuration as genuine engineering rather than afterthought.

The Four Layers of Context

Context contains distinct layers serving different functions. Teams achieving reliable large-scale output work across all four.

Layer 1: The Spec Layer

This remains underutilized in software development yet most directly mirrors strong human engineering teams. Senior engineers consult PRDs, acceptance criteria, and technical specifications before coding. Models should too.

Well-constructed PRDs provide the “why”—what problem gets solved, who benefits, what success resembles. Acceptance criteria prove more valuable still: explicit, testable statements defining completion. Technical specs translate this into implementation decisions—involved services, data models, architectural fit.

When teams omit this layer, models fill the gap with assumptions, often producing technically correct but contextually wrong output.

Example acceptance criteria:

## Feature: User CSV export

**What it does:** Allows users to export transaction history as CSV.

**Acceptance criteria:**
- Export button appears only for paid plan users
- CSV includes: date, description, amount, category, status
- Amounts as raw numbers (no currency symbols)
- Empty state: export headers-only CSV, no error
- File name format: transactions-YYYY-MM-DD.csv using user's local timezone

**Out of scope:**

Six lines of acceptance criteria clarify what completion means, what to exclude, and prevent common scope creep that appears in AI-generated code.

The spec layer compounds over time. A living feature catalog documenting what was built, why, and how it interconnects becomes invaluable context. Without it, each task starts from scratch. With it, models understand their operational system.

Layer 2: The Knowledge Layer

This encompasses everything injected to convey your specific situation to the model. In coding contexts, typically a CLAUDE.md file in the repository root, loaded automatically. It contains architectural decisions, naming conventions, preferred and prohibited libraries, and common pattern handling.

This layer should read like documentation for a brilliant new hire, not command instructions. You provide background enabling good independent decisions rather than behavioral instructions.

Compare approaches:

Weak:
You are a helpful coding assistant. Write clean code.

Strong:
You are a senior backend engineer in a TypeScript monorepo.
The codebase uses Zod for validation, Prisma for database access,
and enforces strict service/route handler separation.
Never use any as a type. Always handle errors explicitly.

Same model, completely different output. The difference lies not in instruction but context.

Layer 3: Conversation History

Models read everything preceding in conversations, shaping every subsequent response. Vague request histories calibrate models to that level. Starting fresh conversations for distinct tasks maintains context cleanliness. Extended single conversations degrade quality as cluttered, contradictory history accumulates. Long conversations also consume context window space, potentially causing models to drop earlier content without indication.

Layer 4: The Retrieved Layer

This encompasses anything pulled dynamically based on specific tasks—search results, code files, documentation snippets, tool outputs. Teams frequently err by retrieving excessively. More context does not equal better context. Irrelevant context actively harms by diluting signal and providing confusion sources.

Retrieval precision matters critically. A directly relevant 200-line function outperforms a 2,000-line file that is 90% noise. Teams dumping entire documentation sites into context while wondering why models confuse unrelated features have created retrieval problems, not model problems.

Consider a payment processing bug: retrieving “everything payment-related” means 14 files, 3,800 lines covering webhooks, invoicing, refunds, and subscriptions. The model reasons across all to find one relevant function. The same task with precision retrieval: the specific handler, two utility functions it calls, and its expected error type. Ninety lines. The model proceeds directly to the problem.

The Six Failure Modes That Repeat Across Teams

Skipping the spec layer entirely. Handing models vague task descriptions while expecting intent inference represents the biggest poor-output source. PRDs need not be lengthy. Even one-page documents with clear acceptance criteria dramatically improve output quality.

Writing the knowledge layer instruction-style. The knowledge layer should read as documentation—background, context, conventions. Models process this differently than command lists, reflected in output.

Not managing conversation length. One extended conversation commonly causes mid-session quality drops. Fresh conversations for distinct tasks function as quality control, not mere workflow preference.

Retrieval without curation. Pulling every potentially relevant file differs from pulling the right files.

Ignoring implicit structure. Context order and format matter. Google’s ADK team documented that context flooded with irrelevant data causes models to fixate on past patterns rather than immediate instructions. Placing most important constraints at input end consistently improves output.

Conflating model capability with context quality. When output disappoints, most blame the model. Usually, context explains the problem. Before concluding a model cannot accomplish something, rebuild context from scratch and retry.

How to Build This in Four Weeks

Week one: observe, don’t optimize. Document every failure: what was requested, what went wrong, what knowledge it lacked. This list seeds your knowledge layer.

Week two: build the knowledge layer. Transform week-one patterns into a CLAUDE.md addressing them directly. Cover domain knowledge the model repeatedly misses: naming conventions, architectural constraints, library preferences, off-limits areas. Write documentation-style, not instructions.

Week three: get serious about specs. Before meaningful tasks, write at minimum a brief PRD and acceptance criteria. Add technical specs for architecturally complex work. This need not be formal—markdown files with several sections suffice. Writing forces clarity that directly translates to superior output.

Week four: test retrieval. Be deliberate about external content retrieval and volume. Run identical tasks with different retrieval strategies, comparing output quality. Less usually proves more.

Afterward, iterate. The knowledge layer is a living document. Update when catching new failure patterns.

This month-long process establishes foundations. Subsequently, you stop fighting the model and begin collaborating with it.

What This Has to Do With Specs

The spec layer is not peripheral to context engineering; it is the most essential layer.

Spec-driven development’s core insight: agents perform only as well as their instructions, and the most important instructions are not those typed into chat boxes. They are pre-session documents—the PRD, acceptance criteria, technical constraints, architectural decisions.

When these documents live in repositories, they are machine-readable. Agents can access them. When they exist in Slack threads or Google Docs, they effectively do not exist from agent perspectives.

Teams shipping reliably at scale treat specs not as human documentation but as the primary mechanism through which human intent becomes legible to agents. This reframing affects how carefully these are written, how precisely completion is defined, and how consistently they are maintained as codebases evolve.

The prompt is not the variable. The context is.

FAQ

What is context engineering? Context engineering is deliberately designing everything flowing into a model’s context window—not merely the prompt, but task descriptions, conversation history, retrieved documents, tool outputs, memory artifacts, specs, and connecting structures.

How is context engineering different from prompt engineering? Prompt engineering focuses on crafting individual well-worded instructions. Context engineering addresses the entire information environment in which models operate. Prompts exist within context; context engineering determines what context contains, how it is structured, and how it evolves across sessions.

Why does context quality matter more than model quality? All teams access identical frontier models. The differentiator is not capability but environment. Poor-context models produce poor output regardless of capability. Rich, well-structured-context models produce substantially superior output from identical raw capability.

What is the most underused layer of context? The spec layer. Most teams skip it, handing models vague task descriptions. Even short PRDs with clear acceptance criteria dramatically improve output quality by eliminating gaps models would fill with their own assumptions.

What is context pollution? Context pollution involves excessive irrelevant, redundant, or conflicting information within the context window. It distracts models and degrades reasoning accuracy. Teams often mistakenly treat greater context as superior context. Retrieval precision—pulling most relevant information rather than complete dumps—represents one of the highest-leverage improvements available.

How do specs relate to context engineering? Specs are the primary mechanism through which human intent becomes legible to agents. In well-designed context systems, specs function not as human documentation but as machine-readable repository files that agents read to understand what to build, what constitutes completion, and what constraints to respect. Poor specs commonly underlie poor agent output.