Devplan

Context Engineering: Why Your Prompts Aren’t the Problem

2026-03-23T06:00:00-07:00

The Teams Winning With AI Didn’t Write Better Prompts

Engineering teams shipping substantial code volumes are not using different models or writing superior prompts. They have constructed a superior operational environment for their agents. The gap stems not from capability differences—all teams access identical frontier models—but from what they present to those models: how context is structured, memory is managed, feedback loops close, and agents receive a coherent picture of their objectives.

What Context Engineering Actually Is

Context engineering involves strategically populating the context window with precisely calibrated information for each step. Andrej Karpathy defined it as “the delicate art and science of filling the context window with just the right information for the next step.” Shopify CEO Tobi Lutke framed it as “the art of providing all the context for the task to be plausibly solvable by the LLM.” Google’s engineering team emphasized treating “context as a first-class system with its own architecture, lifecycle, and constraints.”

Context engineering differs fundamentally from prompt engineering. Most people conceptualize context as a container—you insert materials, the model reads them, you receive output. This misses what actually occurs. Models lack memory beyond the context window. Everything about your task, codebase, preferences, history, and constraints must exist within that window currently. Without it, the model guesses.

Two engineers face identical-complexity tasks on the same model. Engineer A writes a clean, specific prompt. Engineer B writes a mediocre prompt but maintains a CLAUDE.md file in the repository root, has loaded relevant code examples, and possesses a document explaining team conventions. Engineer B consistently receives superior output. The prompt barely mattered; context accomplished everything.

Why Context Failures Look Like Model Failures

Research on AI agent failures reveals teams misattribute root causes. When output disappoints, the reflexive response involves blaming the model, swapping models, or refining the prompt. The actual cause almost invariably resides in the context.

Anthropic’s engineering team observed that “most agent failures are not model failures. They are context failures.” The guidance emphasizes thoughtfulness—maintaining context that remains informative yet concise. This represents a design constraint rather than a prompting problem.

Birgitta Böckeler at Thoughtworks noted that “the number of options to configure and enrich a coding agent’s context has exploded,” with reliable output coming from teams treating configuration as genuine engineering rather than afterthought.

The Four Layers of Context

Context contains distinct layers serving different functions. Teams achieving reliable large-scale output work across all four.

Layer 1: The Spec Layer

This remains underutilized in software development yet most directly mirrors strong human engineering teams. Senior engineers consult PRDs, acceptance criteria, and technical specifications before coding. Models should too.

Well-constructed PRDs provide the “why”—what problem gets solved, who benefits, what success resembles. Acceptance criteria prove more valuable still: explicit, testable statements defining completion. Technical specs translate this into implementation decisions—involved services, data models, architectural fit.

When teams omit this layer, models fill the gap with assumptions, often producing technically correct but contextually wrong output.

Example acceptance criteria:

## Feature: User CSV export

**What it does:** Allows users to export transaction history as CSV.

**Acceptance criteria:**
- Export button appears only for paid plan users
- CSV includes: date, description, amount, category, status
- Amounts as raw numbers (no currency symbols)
- Empty state: export headers-only CSV, no error
- File name format: transactions-YYYY-MM-DD.csv using user's local timezone

**Out of scope:**

Six lines of acceptance criteria clarify what completion means, what to exclude, and prevent common scope creep that appears in AI-generated code.

The spec layer compounds over time. A living feature catalog documenting what was built, why, and how it interconnects becomes invaluable context. Without it, each task starts from scratch. With it, models understand their operational system.

Layer 2: The Knowledge Layer

This encompasses everything injected to convey your specific situation to the model. In coding contexts, typically a CLAUDE.md file in the repository root, loaded automatically. It contains architectural decisions, naming conventions, preferred and prohibited libraries, and common pattern handling.

This layer should read like documentation for a brilliant new hire, not command instructions. You provide background enabling good independent decisions rather than behavioral instructions.

Compare approaches:

Weak:
You are a helpful coding assistant. Write clean code.

Strong:
You are a senior backend engineer in a TypeScript monorepo.
The codebase uses Zod for validation, Prisma for database access,
and enforces strict service/route handler separation.
Never use any as a type. Always handle errors explicitly.

Same model, completely different output. The difference lies not in instruction but context.

Layer 3: Conversation History

Models read everything preceding in conversations, shaping every subsequent response. Vague request histories calibrate models to that level. Starting fresh conversations for distinct tasks maintains context cleanliness. Extended single conversations degrade quality as cluttered, contradictory history accumulates. Long conversations also consume context window space, potentially causing models to drop earlier content without indication.

Layer 4: The Retrieved Layer

This encompasses anything pulled dynamically based on specific tasks—search results, code files, documentation snippets, tool outputs. Teams frequently err by retrieving excessively. More context does not equal better context. Irrelevant context actively harms by diluting signal and providing confusion sources.

Retrieval precision matters critically. A directly relevant 200-line function outperforms a 2,000-line file that is 90% noise. Teams dumping entire documentation sites into context while wondering why models confuse unrelated features have created retrieval problems, not model problems.

Consider a payment processing bug: retrieving “everything payment-related” means 14 files, 3,800 lines covering webhooks, invoicing, refunds, and subscriptions. The model reasons across all to find one relevant function. The same task with precision retrieval: the specific handler, two utility functions it calls, and its expected error type. Ninety lines. The model proceeds directly to the problem.

The Six Failure Modes That Repeat Across Teams

Skipping the spec layer entirely. Handing models vague task descriptions while expecting intent inference represents the biggest poor-output source. PRDs need not be lengthy. Even one-page documents with clear acceptance criteria dramatically improve output quality.

Writing the knowledge layer instruction-style. The knowledge layer should read as documentation—background, context, conventions. Models process this differently than command lists, reflected in output.

Not managing conversation length. One extended conversation commonly causes mid-session quality drops. Fresh conversations for distinct tasks function as quality control, not mere workflow preference.

Retrieval without curation. Pulling every potentially relevant file differs from pulling the right files.

Ignoring implicit structure. Context order and format matter. Google’s ADK team documented that context flooded with irrelevant data causes models to fixate on past patterns rather than immediate instructions. Placing most important constraints at input end consistently improves output.

Conflating model capability with context quality. When output disappoints, most blame the model. Usually, context explains the problem. Before concluding a model cannot accomplish something, rebuild context from scratch and retry.

How to Build This in Four Weeks

Week one: observe, don’t optimize. Document every failure: what was requested, what went wrong, what knowledge it lacked. This list seeds your knowledge layer.

Week two: build the knowledge layer. Transform week-one patterns into a CLAUDE.md addressing them directly. Cover domain knowledge the model repeatedly misses: naming conventions, architectural constraints, library preferences, off-limits areas. Write documentation-style, not instructions.

Week three: get serious about specs. Before meaningful tasks, write at minimum a brief PRD and acceptance criteria. Add technical specs for architecturally complex work. This need not be formal—markdown files with several sections suffice. Writing forces clarity that directly translates to superior output.

Week four: test retrieval. Be deliberate about external content retrieval and volume. Run identical tasks with different retrieval strategies, comparing output quality. Less usually proves more.

Afterward, iterate. The knowledge layer is a living document. Update when catching new failure patterns.

This month-long process establishes foundations. Subsequently, you stop fighting the model and begin collaborating with it.

What This Has to Do With Specs

The spec layer is not peripheral to context engineering; it is the most essential layer.

Spec-driven development’s core insight: agents perform only as well as their instructions, and the most important instructions are not those typed into chat boxes. They are pre-session documents—the PRD, acceptance criteria, technical constraints, architectural decisions.

When these documents live in repositories, they are machine-readable. Agents can access them. When they exist in Slack threads or Google Docs, they effectively do not exist from agent perspectives.

Teams shipping reliably at scale treat specs not as human documentation but as the primary mechanism through which human intent becomes legible to agents. This reframing affects how carefully these are written, how precisely completion is defined, and how consistently they are maintained as codebases evolve.

The prompt is not the variable. The context is.

FAQ

What is context engineering? Context engineering is deliberately designing everything flowing into a model’s context window—not merely the prompt, but task descriptions, conversation history, retrieved documents, tool outputs, memory artifacts, specs, and connecting structures.

How is context engineering different from prompt engineering? Prompt engineering focuses on crafting individual well-worded instructions. Context engineering addresses the entire information environment in which models operate. Prompts exist within context; context engineering determines what context contains, how it is structured, and how it evolves across sessions.

Why does context quality matter more than model quality? All teams access identical frontier models. The differentiator is not capability but environment. Poor-context models produce poor output regardless of capability. Rich, well-structured-context models produce substantially superior output from identical raw capability.

What is the most underused layer of context? The spec layer. Most teams skip it, handing models vague task descriptions. Even short PRDs with clear acceptance criteria dramatically improve output quality by eliminating gaps models would fill with their own assumptions.

What is context pollution? Context pollution involves excessive irrelevant, redundant, or conflicting information within the context window. It distracts models and degrades reasoning accuracy. Teams often mistakenly treat greater context as superior context. Retrieval precision—pulling most relevant information rather than complete dumps—represents one of the highest-leverage improvements available.

How do specs relate to context engineering? Specs are the primary mechanism through which human intent becomes legible to agents. In well-designed context systems, specs function not as human documentation but as machine-readable repository files that agents read to understand what to build, what constitutes completion, and what constraints to respect. Poor specs commonly underlie poor agent output.

The Harness Is Everything: Why Your AI Coding Agent Keeps Failing

2026-03-17T06:00:00-07:00

What Is an AI Agent Harness?

An AI agent harness is the complete designed environment in which a language model operates. It includes the tools the agent can call, how information is formatted and delivered into context, how history is compressed and managed across sessions, the guardrails that catch mistakes before they cascade, and the scaffolding that lets an agent hand off coherent work to its future self.

A harness is not a system prompt. It is not a wrapper around an API call. It is not a longer prompt or a better model. It is the infrastructure layer that determines what any model can actually accomplish, regardless of which model you use.

Ninad Pathak at Firecrawl published one of the most thorough breakdowns of the concept recently, covering the core components in detail: the tool layer, memory architecture, context compression, and verification loops.

The distinction between harness and model matters because most teams spend their time optimizing the wrong thing. They iterate on prompts, swap models, adjust temperature. The teams shipping reliably at scale are investing in environment design.

Why AI Coding Agents Fail (And It Is Not the Model)

The pattern is consistent across every serious team that has documented this publicly. When an AI coding agent produces bad output, the root cause almost always traces back to one of four environment failures.

Context flooding. The agent receives too much information at once. Irrelevant data competes for attention with relevant data, and output quality degrades across every subsequent step.

Missing feedback loops. The agent writes code but cannot observe whether it actually works from a user’s perspective. It optimizes for proxy metrics that do not reflect real correctness.

No persistent state. The agent has no reliable way to know what was done in a previous session, what counts as done, or what the current project state actually is.

Stale or informal context. Requirements, architectural decisions, and constraints live in Slack threads, Google Docs, or people’s heads. From the agent’s perspective, they do not exist.

Each of these is a harness problem, not a model problem. Kyle at HumanLayer makes this case with unusual directness in his piece on harness engineering for coding agents. His argument, drawn from a year of watching coding agents fail in production: bad agent output is almost never a model problem. It is a configuration problem.

The Research That Proved It: The 64% Gap

The clearest empirical proof of this came from research that the harness engineering community keeps coming back to. Researchers tested the same model on identical coding tasks—real GitHub issues from popular open source repositories—using two different environments.

With a standard bash shell interface, the system resolved 3.97% of issues. With a purpose-built agent harness, the same model resolved 12.47%. That is a 64% relative performance improvement from environment design alone. Same model. Same task. Same compute.

The harness achieved this through four specific decisions.

Capped search results. Standard search commands can return thousands of lines. When agents get flooded, they thrash, issuing more searches, accumulating noise, and filling context with irrelevant data. The harness capped results at 50 and forced refinement when exceeded.

A stateful file viewer with line numbers. The viewer maintained position across interactions and prepended explicit line numbers to every visible line. When an agent needs to edit specific lines, it should read those numbers directly rather than count them.

An editor with integrated linting. Every edit triggered an automatic linter. Syntax errors were caught and rejected before being applied, with a clear error message. Without this, agents introduce a syntax error, run tests, see a seemingly unrelated failure, and spend ten steps chasing a ghost.

Context compression. Older observations were collapsed into single-line summaries. The agent could always see recent, relevant state without being buried in the full uncompressed history of every command it had ever run.

The clearest proof in the literature that the bottleneck in AI agent performance is almost never the model. It is the environment.

How Anthropic Solved the Long-Running Agent Problem

Anthropic’s engineering team, building Claude Code, encountered a harder version of the same problem: tasks too large to complete inside a single context window.

Most real software projects do not fit in any context window. A production web application has hundreds of files, thousands of functions, a test suite, configuration, and dependencies. Human engineers navigate this through external memory, documentation, and accumulated context built over time. An agent starting a fresh session has none of that.

Internal experiments revealed two failure patterns consistent enough to become the design spec for their harness architecture.

Attempting to do too much at once. Given a prompt like “build a clone of claude.ai,” the agent would try to one-shot the entire application, implementing features without completing or testing any of them, running out of context mid-implementation, and leaving the next session to start with a half-built app and no documentation of what state it was in.

Declaring victory too early. After some features had been built, a subsequent agent would look around, see progress, and conclude the job was done. Not because it was unintelligent, but because it had no structured way to know what done actually meant for this project.

The Initializer and Coding Agent Architecture

The solution was a two-part harness. An initializer agent runs once and creates three things.

An init.sh script that reliably starts the development environment. Every subsequent session begins by running this script. The tokens saved on environment setup across dozens of sessions accumulate significantly.

A structured feature list—over 200 specific end-to-end feature descriptions, each initially marked as failing. This file is the project’s ground truth. An agent starting a new session reads it and knows exactly what has and has not been built. It cannot look at working code and conclude the job is done. The feature list tells it the truth. Stored as JSON rather than Markdown deliberately—models are less likely to casually overwrite JSON files. The rigid structure resists the kind of editing you do not want.

A claude-progress.txt file updated at the end of every session. Combined with git history, it gives every future agent a fast orientation without burning context on archaeology.

The coding agent that runs in every subsequent session has a tighter mandate: work on one feature at a time, leave the environment clean, and update the progress file and git history before the session ends.

The Feedback Loop Failure Nobody Talks About

Anthropic also documented a failure mode that shows up in virtually every agentic coding project: agents marking features complete without verifying them end-to-end.

An agent writes code, runs a unit test, sees it pass, marks the feature done. But the feature does not work when a real user interacts with it through a browser. The gap between unit test success and real-world functionality is something human engineers navigate by actually running the application. An agent without browser automation cannot make that shift.

The fix was giving agents access to browser automation tools so the agent could navigate the application, click buttons, fill forms, and verify real user flows. The improvement was substantial.

The principle generalizes: the quality of an agent’s work is bounded by the quality of its feedback loops. If the agent cannot observe the consequences of its actions in the domain that matters, it will optimize for proxies that do not correlate with correctness.

How OpenAI Shipped a Million Lines With No Manual Code

OpenAI’s Codex team started a repository with one constraint: no human-written code. Everything including application logic, tests, CI configuration, documentation, and observability tooling would be written by agents. Humans would steer. Agents would execute.

The result: approximately one million lines of code, roughly 1,500 merged pull requests, and three engineers averaging 3.5 PRs per engineer per day. As the team grew, per-engineer throughput increased. A real internal product with hundreds of daily users.

The central observation from their writeup: the engineering job changed entirely. When you are not writing code, you are designing environments, specifying intent, and building feedback loops. When something failed, the fix was almost never “try harder.” It was almost always “what structural piece of the environment is missing that is causing this failure?”

The Repository as System of Record

One of the most consequential decisions was making the repository the source of truth for everything an agent needed to know. Anything in a Slack thread or a Google Doc is invisible to the agent. If the agent cannot access it in context, it effectively does not exist.

Early on, the team tried the one big AGENTS.md approach, a single large instruction file containing everything. It failed in four consistent ways. A giant instruction file crowds out the actual task and relevant code. When everything is marked important, nothing is. A monolithic manual rots instantly as the codebase evolves. And a single blob is nearly impossible to verify for freshness or coverage.

The solution was a structured docs/ directory as the system of record, with a short AGENTS.md of roughly 100 lines serving as a map to deeper truth elsewhere. Progressive disclosure: agents start with a small, stable entry point and are pointed toward more when they need it, rather than overwhelmed upfront.

Mechanical Architecture Enforcement

When agents are opening 3.5 PRs per engineer per day, human code review cannot be the primary quality mechanism. The solution was encoding architectural constraints as mechanical checks that run at the point of violation rather than days later in a PR comment.

Custom linters enforced dependency directions, boundary crossing, and interface consistency. The key principle: enforce invariants, not implementations. Care deeply about structural rules. Do not dictate how a specific function is built, as long as it satisfies its behavioral contract. Every linter error message was formatted specifically for injection into agent context, including the rule violated, the violation found, and the remediation steps, all in one actionable message.

Where the Term “Harness Engineering” Came From

The term started spreading earlier this year. Charlie Guo’s piece synthesized converging practices from teams at OpenAI, Stripe, and others.

The core observation Guo made, and that the research supports, is that harness engineering is a discipline in the same way that infrastructure engineering is a discipline. It is not about any single tool or technique. It is about treating the agent’s environment as a first-class engineering concern rather than an afterthought to the model.

The Five Patterns That Repeat Across Every High-Performing Harness

Across all of these systems and teams, several design patterns appear consistently. They are not coincidences. They are engineering solutions to problems that consistently emerge when deploying agents at scale.

Progressive disclosure. Give the agent the minimum it needs to orient itself, plus pointers to find more when it needs it. A short, focused entry point that maps to deeper context outperforms a comprehensive dump every time. It is also dramatically easier to keep accurate.

Git worktree isolation. One agent, one worktree. Every serious orchestration system uses this. Git worktrees give each agent its own working directory, branch, and environment. Changes are validated in isolation before touching the main codebase.

Spec first, repository as system of record. If it is not in the repository, it does not exist from the agent’s perspective. Specifications, requirements, architectural decisions, and constraints must be encoded into machine-readable files before execution begins. Documentation is no longer just for human readers. It is the mechanism through which human intent becomes legible to agents.

Mechanical architecture enforcement. Encode architectural constraints as automated checks that run at the point of violation. Enforce invariants, not implementations. Allow significant freedom within them. The linter catches the violation and the error message remediates it. Human review focuses on judgment calls, not structural drift.

Integrated feedback loops. Close the gap between action and consequence as tightly as possible. Syntax errors caught at edit time. Runtime errors surfaced through observability tools the agent can query. UI bugs caught through browser automation the agent can drive. For agents, errors not caught immediately accumulate in context and degrade every subsequent reasoning step.

What This Means for How You Build

When something is not working in your agent system, the harness mindset produces a different diagnostic than the default one.

Instead of “how do I write a better prompt?” ask “what information does the agent need that it currently cannot access?”

Instead of “why is the model making this mistake?” ask “what feedback loop is missing that would catch this before it propagates?”

Instead of “why is the agent not doing what I told it?” ask “what constraint in the environment is preventing it?”

This shift changes where engineering effort goes. A prompt fix solves one specific failure mode. A harness improvement prevents a category of failure modes, permanently, across every future session.

The Minimal Harness for a Real Project

You do not need a full observability stack to benefit from this thinking. Four components cover most of it.

A persistent progress file. The agent reads it at session start and writes it at session end. This alone prevents the “declare victory too early” failure and ensures continuity across context window boundaries.

A structured task list with verifiable completion criteria. Not a vague project description. A specific, enumerated list of user-visible behaviors testable end-to-end. Status updates only after verification.

Version control as a first-class session requirement. Every session ends with a commit and an updated progress file. Clean state is not a nice-to-have.

Browser automation if you are building for the web. The difference between an agent that can only read code and one that can use the application it is building is the same as the difference between a developer who reads code and one who runs it.

The Uncomfortable Bottom Line

If execution is a commodity, and the evidence suggests it increasingly is, the long-term competitive advantage in AI-driven development is not the model. It is the harness.

The teams that have figured this out built custom development environments for their specific codebases and domains. They built harness architectures enabling months of coherent incremental progress. They demonstrated dramatically better results from the same models through environment design alone. None of those advantages came from the model. They came from the environment.

The model is what thinks. The harness is what it thinks about.

FAQ

What is an AI agent harness? An AI agent harness is the complete designed environment in which a language model operates, including its tools, context structure, memory management, feedback loops, and session scaffolding. It determines what the model can actually accomplish, independent of the model’s raw capability.

Why do AI coding agents fail on complex projects? The most common failure modes are context flooding (too much irrelevant information degrading output quality), missing feedback loops (the agent cannot observe whether its work actually functions), no persistent state across sessions, and requirements that exist outside the repository where the agent cannot access them.

What is the difference between a harness and a prompt? A prompt is the input you send to the model in a single interaction. A harness is the entire system that determines what context the model receives, what tools it can use, how errors are caught, how state persists across sessions, and what constraints are enforced automatically. Prompts live inside harnesses.

How does spec-driven development relate to harness engineering? Specs are the primary mechanism through which human intent becomes legible to agents. In a well-designed harness, specs are not just documentation for humans. They are machine-readable files in the repository that the agent reads to understand what to build, what counts as done, and what constraints to respect. Poor specs are one of the most common root causes of poor agent output.

What is the minimal harness I can build today? Start with four things: a progress file the agent reads and writes each session, a structured feature list with verifiable completion criteria, git commits as a required end-of-session step, and browser automation if you are building a web product. That covers the majority of failure modes most teams run into.

What Separates Good AI Dev Teams From Great Ones

2026-02-19T06:00:00-08:00

Steve Yegge recently published an observation that resonated across engineering teams: developers using AI coding tools most heavily experience the highest burnout, drowning in review queues and running faster just to stay in place.

Yet this outcome isn’t inevitable. While every team has access to the same tools—Cursor, Claude Code, Copilot—some teams are pulling ahead while others struggle. The gap isn’t in the AI tools themselves, but in what happens between the initial idea and the first prompt.

Where Most Teams Are Losing Time They Don’t Know They’re Losing

The typical workflow loses fidelity at each handoff. An idea gets written loosely in Notion or Jira. An engineer interprets it and prompts an AI agent. The agent fills remaining gaps with statistical guesses. Code enters review, where senior engineers must reconstruct intent and send it back. This cycle repeats two or three times.

Every handoff drops context. The original intent becomes increasingly distant from what the agent was actually told. The code that emerges looks functional but isn’t quite right, forcing experienced engineers to spend extensive time figuring out where it missed.

This review burden doesn’t stem from AI writing poor code—it comes from the gap between intent and specification.

What Great Teams Do Differently

High-performing teams recognize that leverage lives in preparation, not in the IDE. They treat the handoff from idea to agent as the most critical moment in development.

These teams ensure agents receive complete specifications before generation begins:

Real acceptance criteria, not implied expectations
Explicit constraints
Edge cases fully addressed
Context grounded in actual codebase architecture

With proper specifications, output aligns with intent on the first pass. Review becomes a straightforward criteria check rather than archaeological investigation. Senior engineers focus on high-leverage work. Shipping accelerates not through faster generation, but through closer initial alignment.

Building such specifications manually requires significant effort—pulling codebase context, thinking through edge cases, writing precise acceptance criteria demands discipline. Most teams skip this layer, treating it as overhead. This is the critical mistake.

From Rough Idea to Agent-Ready Spec

The solution bridges the gap between idea and execution. Starting with rough intent, the process refines it into something an agent can work from. Codebase context loads automatically, grounding the spec in actual system architecture. Supporting materials, designs, and research fold in naturally.

The result transcends documentation—it becomes a specification that understands architecture, covers edge cases, and provides agents the constraints needed for sound decisions without guessing.

The Cost of Not Having This Layer

Teams lacking a specification layer don’t immediately recognize the damage. Problems accumulate gradually:

Review cycles lengthen beyond necessity
Senior engineers increasingly spend time reconstructing AI intent rather than high-leverage work
Junior developers lack clear contribution points, becoming passive observers
Technical debt grows silently as agents guess at unspecified gaps
Shipping velocity plateaus or drops despite faster generation speeds

None of this announces itself as crisis—it simply makes everything incrementally slower and harder. Meanwhile, teams with proper AI coding processes compound their advantages continuously.

Comparison: With and Without Specification Layers

Without a Spec Layer

Ideas scattered across Notion, Jira, Slack
Codebase context exists only in developers’ minds
Specs remain vague with agent-filled gaps
First-pass output technically acceptable but contextually misaligned
Multiple review rounds with senior engineers pulled in
Cognitive load falls on senior reviewers
High technical debt from guessed solutions
Context drops between tools
Trajectory: slower and harder

Spec-Driven Approach

Single location from idea through execution
Codebase context pulled in automatically
Structured specs grounded in actual architecture
First-pass output aligned with intent
Single-pass review using written criteria
Cognitive load falls on spec writer at any level
Low technical debt from pre-specified gaps
Spec and execution remain integrated
Trajectory: faster and cleaner

The Move Great Teams Are Making Now

AI coding represents software development’s future—that’s settled. The open question concerns which teams build processes enabling AI at scale versus those cleaning up debt from skipping this layer.

Leading teams work no harder or spend more on tools. They discovered that generation leverage lies not in the tools themselves, but in what precedes it: codebase context, genuine acceptance criteria, constraints reflecting actual system design, and an execution environment where specification intent directly translates to generated code.

This process transforms good AI teams into exceptional ones.

Why Intent Is the New Bottleneck in AI Development

2026-01-21T06:00:00-08:00

Velocity without direction is just expensive rework

AI made execution cheap. A working feature can come together in an afternoon. But most teams are finding that speed alone doesn’t translate into shipping the right thing, and the data backs that up.

Bain’s 2025 Technology Report found that teams using AI assistants see only “10 to 15 percent productivity gains,” and the time saved rarely turns into business value. Their research also showed that writing and testing code accounts for just 25 to 35 percent of the development lifecycle. Speeding up that one slice without fixing the inputs just moves the bottleneck somewhere harder to see.

A frontend lead audited a feature that an agent had completed overnight. It worked. The buttons clicked. The data saved. But when he looked at the code, the agent had imported three different date-parsing libraries to handle a single timestamp and hard-coded the timezone to UTC-8 because the prompt didn’t specify otherwise.

The code wasn’t broken. But it was heavy, wrong in ways that wouldn’t surface until someone tried to extend it, and expensive to fix after the fact. He spent the next two days untangling dependencies that didn’t need to exist, which is roughly how long it would have taken to write the feature from scratch.

This pattern shows up constantly. One developer described giving up on a project after three months: “Every time I want to change a little thing, I kill 4 days debugging other things that go south.” The agent keeps fixing symptoms because it doesn’t know the root cause. The developer doesn’t know the root cause either, because they didn’t write the code.

The uncomfortable truth is that teams are spending less time typing and more time auditing code they didn’t author.

What the agent sees	What the agent doesn’t know
“Handle payment errors”	Payment retries are legally prohibited for this transaction type
Timestamp field in the schema	Team uses UTC everywhere, never local timezones
Multiple date libraries in package.json	Only day.js is approved, the others are legacy
Redux in older components	Team migrated to Zustand six months ago
No tests in the file	Testing is required, the previous dev just skipped it

The agent doesn’t know why you chose boring technology over clever technology, why you picked Postgres over Mongo, or why the payment flow needs to be idempotent. It ships its best guess, and its best guess is statistically reasonable but architecturally wrong for your specific system.

If this sounds like your team, the fix isn’t better prompting. It’s giving agents structured context before they start writing.

Where intent goes to die

Intent doesn’t disappear all at once. It leaks out at specific points in the workflow, and each leak compounds downstream.

Where it leaks	What happens	What the agent does
Planning	Ticket describes outcome but not constraints	Agent treats ambiguity as a design decision
Context transfer	Decisions live in Slack, Notion, and people’s heads	Agent has no access, fills in blanks
Accumulation	Undocumented patterns pile up in the codebase	Next agent treats accidental patterns as intentional

The first leak happens in planning. A PM writes a ticket that says “User sees error on failed payment.” That ticket contains an outcome but not the constraints around it. It doesn’t say which error component to use, whether the system should retry, or what the logging behavior should be. A human engineer would ask follow-up questions. An AI agent treats the ambiguity as a design decision and makes one.

A team running a checkout flow learned this the hard way. Their agent added a “Retry” button to a payment screen for a transaction type that legally cannot be retried. The prompt didn’t say “no retries,” so the agent optimized for UX and guessed wrong. The feature passed QA because the testers were checking functionality, not legal compliance. It made it to staging before someone from the payments team caught it.

The second leak happens in context transfer. Architecture decisions, past trade-offs, and team preferences live in Slack threads, Notion docs, and people’s heads. None of that reaches the agent. A paper on vibe coding documented what happens when constraints are absent: a team asked an AI to fix display issues, and it responded by rewriting state management, adding new API endpoints, and creating debugging panels. The codebase grew by hundreds of lines. The root cause, a simple API mismatch, stayed unfixed because the agent lacked the constraint that would have pointed it to the actual problem.

The third leak is cumulative. Every project that runs without structured intent makes the next one harder, because the codebase now contains decisions nobody documented and patterns nobody chose deliberately. Six months later, a new agent working on a related feature treats those accidental patterns as intentional architecture and builds on top of them.

Anthropic’s engineering team wrote about a version of this problem: if a human cannot definitively say which tool to use for a task, an AI agent will not do better. The fix isn’t better models. It’s closing the gap between the person who understands the reasoning and the system that executes the code.

Spec-driven development is the missing layer

Spec-driven development has been getting a lot of attention since mid-2025, with GitHub’s Spec Kit, JetBrains’ Junie, AWS Kiro, and Augment all building some version of it. The core idea is the same across all of them: write a structured specification before any code gets written, and use that spec as the source of truth that agents work from.

The concept isn’t new. As Martin Fowler’s team at ThoughtWorks pointed out, specs have been used in software engineering for decades, from model-driven development to behavior-driven development. What’s different now is the audience. Specs used to be written for future developers. In AI-assisted development, specs are written for machines, and machines need a different kind of clarity than humans do.

Humans need explanations. Agents need prohibitions.

A good spec for an agent includes three layers that most PRDs skip entirely:

Decision logs that include the losers. Not just “we chose Postgres” but “we chose Postgres over Mongo because we need ACID compliance for the payment ledger.” If you don’t feed that constraint to the agent next week, it will write code that assumes eventual consistency. Architecture Decision Records have been around since 2011, but the format needs to shift. The audience is no longer a human who can infer intent from context. It’s a machine that will do exactly what you don’t tell it not to do.

Hard constraints that act as guardrails. These are the things that cannot change: no new npm packages without approval, use the internal UI library for all buttons, no external API calls from client-side code, payment flows must be idempotent. These constraints stop an agent from fixing one thing and breaking three others, which is the failure mode that showed up with the date-parsing libraries.

Specificity about edge case behavior. Instead of “user sees error,” the spec says “if API returns 400, display Toast Component ID ERR_400, do not auto-retry, log to Sentry with payment_id.” Ambiguity in a spec is functionally the same as a prompt injection. It tells the agent “use your judgment,” and the agent’s judgment is a statistical average of every codebase it was trained on, not yours.

Here’s what the difference looks like in practice:

Ticket	Spec
“Handle error states on checkout”	If API returns 400: display , do not retry, log to Sentry with `payment_id`
“Add user authentication”	Use JWT with RS256 signing, refresh token rotation, 15-min access token TTL, store refresh token in httpOnly cookie
“Improve page load speed”	Lazy-load below-fold images, split vendor bundle from app bundle, target LCP under 2.5s on mobile 4G
“Fix the date display bug”	All dates render in UTC, use day.js only, format as `YYYY-MM-DD HH:mm` in all admin views

The left column is what most agents receive. The right column is what they need.

Making the intent layer stick

Individual specs help on a per-project basis, but the real value shows up when the context compounds across projects. A spec for Feature A that documents why you rejected a particular approach becomes context that Feature B’s agent can reference three months later. The system remembers what was tried, what was rejected, and why, so each project makes the next one better.

This is where most teams hit a wall. Static docs rot. A Notion page written in January is outdated by March because nobody updates it when the architecture changes. The spec layer needs to be connected to the actual codebase and updated as decisions are made, not maintained as a separate artifact that drifts from reality.

An engineering manager at a 30-person SaaS company described the before and after. Before, her team’s agents were producing code that technically worked but kept introducing patterns the team had explicitly moved away from. An agent would use Redux in a component because the older parts of the codebase still had Redux, even though the team had migrated to Zustand six months earlier. Nobody had told the agent, and the codebase itself sent mixed signals.

After implementing structured specs with their codebase context attached, the agents started following the team’s actual conventions. Not because the model got smarter, but because the inputs got better. The specs told the agent which patterns to follow and which to ignore, and the codebase analysis gave it the information to distinguish between the two.

The pattern she described is exactly what Bain’s research predicted. The companies seeing 25 to 30 percent productivity gains aren’t the ones with better models. They’re the ones that redesigned the workflow around the model, feeding it structured context instead of raw tickets and hoping for the best.

What to do this week

You don’t need to overhaul your workflow to start closing the intent gap. Pick one project that’s about to kick off and try these three things:

Write the decision log before the first line of code. Document what you chose, what you rejected, and why. Include the constraints that aren’t obvious from the ticket. “We need ACID compliance” is more useful to an agent than “use Postgres.”

Define five hard constraints for your codebase. These are the rules that never change: approved libraries, required components, forbidden patterns. Put them somewhere the agent can access them, not in a Slack message from four months ago.

Rewrite one vague ticket as a spec. Take a ticket that says something like “handle error states” and expand it with specific component IDs, retry behavior, and logging requirements. Run the agent against the spec instead of the ticket and compare the output.

If the agent produces better code from the spec than from the ticket, you’ve found your bottleneck. It was never the model. It was the input.

The teams that get this right will compound the advantage

Whether your team blends the PM and engineer roles or keeps them separate doesn’t matter as much as whether intent survives the handoff. Both approaches work when there’s a layer that carries context across people and tools.

The good news is that this isn’t a massive process overhaul. It starts with writing better inputs. A spec that includes constraints, decision history, and edge case behavior gives every agent run a better starting point, and each project that captures those decisions makes the next one faster.

The teams adopting spec-driven development now are building a compounding asset. Every documented decision, every logged constraint, every structured spec feeds into the next project. Six months from now, their agents are working from a rich, accurate picture of how the system works and why. That gap between teams who structure their intent and teams who don’t will only widen as agents take on more of the build.

The shift is small but the payoff is real: less time re-explaining the same things every sprint, less time auditing code that missed the point, and more time spent on the work that actually moves the product forward.

The Shift to Spec-Driven Development

2025-10-22T06:00:00-07:00

The Problem

AI is now writing real code, but it still has no idea what we actually want. It produces results with confidence, even when they are misaligned or just plain wrong. In production environments, that overconfidence turns small mistakes into costly problems and wasted cycles.

AI coding tools rely on what we give them as input. When we hand them basic requirements without structure or boundaries, they have to infer architecture, dependencies, and intent. And they often guess incorrectly.

As codebases grow, the problem compounds. Agents hallucinate APIs, misread structure, or fix one issue by breaking three others. Teams waste hours reviewing AI output, rewriting code, and patching misunderstandings that never should have happened.

This is not a tooling problem. It is a context problem.

The Solution

Spec-Driven Development gives AI and humans a shared language for intent.

Specifications become the primary artifact, and code becomes their expression. Each project acts as a container for features and knowledge, holding the entire context for that area of the product or platform.

A Living Project contains two core specs:

Requirements Spec – captures user intent, goals, acceptance criteria, and success metrics.
Tech Design Spec – maps product requirements to system design: APIs, data models, dependencies, integrations, and constraints.

Inside each project, features represent releasable slices of functionality broken into engineering tasks small enough to map to a single pull request.

Organizations maintain a system-wide platform spec, a living document representing the current state of the entire product, initially generated from deep codebase understanding and updated automatically as projects evolve.

Core Principles

Specs as the Source of Truth – Functional and technical specs define the system. Code is their reflection.
Continuous Spec Integration – Specs evolve through updates, review, and versioning like code.
System Coherence – Project-level specs roll into a platform spec, keeping the product aligned.
Human Judgment, Machine Execution – Builders approve specs and guide direction; AI executes reliably.

Why This Matters

The software industry has reached a breaking point. Complexity has grown faster than our ability to manage it. Teams operate in fragmented systems with tickets in one tool, designs in another, specs in a third, code in a fourth, and AI sits awkwardly on top, trying to connect dots.

Spec-Driven Development creates a single layer of truth between human decision-making and AI execution, turning planning and implementation into a continuous, data-driven loop.

The Future of Building

A new contributor type is emerging: builders. They think like product managers and engineers but use different tools. Instead of handing off tickets, they define intent and guide AI systems to bring that intent to life in code.

As AI takes on more of the coding, builders spend their time on what humans are uniquely good at: understanding users, reasoning about systems, and making creative and strategic decisions. Specs become the shared language in that collaboration, and AI becomes the translation layer that turns ideas into software.

How Planning Impacts AI Coding

2025-08-07T07:00:00-07:00

Intro

The development community holds varying opinions on AI’s real-world engineering impact. Some report massive productivity improvements, while others find reviewing AI-written code slows them down. This experiment measured how proper planning affects AI-assisted coding productivity.

Experiment

We tested whether carefully prepared requirements at the feature level produce better results than quick hand-written prompts. The task, based on an open-source repository, was implemented twice by each agent: once with simple high-level requirements, once with detailed specifications.

Simple requirements included:

GitHub repository change analysis functionality
Automated periodic analysis for enrolled repositories
Persisted reports available through API
UI viewability

Detailed requirements covered implementation aspects, design patterns, and architecture decisions. All agents received guidance when stuck but no additional requirement information during implementation.

Criteria

Solutions were evaluated across four dimensions:

Correctness: Implementation alignment with proper design
Quality: Code maintainability and adherence to standards
Autonomy: How independently agents reached final solutions
Completeness: Satisfaction of explicit requirements

Scores ranged from 1-5, with consistency across all dimensions more valuable than individual high scores for parallel execution capability.

Results

Solution	Correctness	Quality	Autonomy	Completeness	Mean ± SD	Improvement
Claude, Short	2	3	5	3	3.75 ± 1.5	20%
Claude, Planned	4+	4	5	4+	4.5 ± 0.4	—
Cursor, Short	2-	2	5	3	3.4 ± 1.9	20%
Cursor, Planned	5-	4-	4	4+	4.1 ± 0.5	—
Junie, Short	1+	2	5	3	2.9 ± 1.6	34%
Junie, Planned	4	3	4+	—	3.9 ± 0.6	—

Key Observations

High-quality planning significantly improves correctness and quality. AI assistants need clearly prepared product and technical requirements to deliver intended results and follow guidelines.

Planning reduces score dispersion. Results became more consistent across all AI assistants with detailed, unambiguous requirements. Different agents often chose similar approaches, suggesting any capable coding assistant works well with proper specs.

Smaller tasks work more autonomously. Claude Code completed detailed requirements without nudging, while Cursor and Junie required additional guidance. Breaking work into smaller chunks increases autonomous completion probability.

Code reviews are major bottlenecks. Getting six AI runs near completion proved easier than reviewing two PRs. As AI coding scales, teams need larger features completed autonomously.

Recommendations for Parallel AI Execution

Prepare detailed specifications outlining scope, acceptance criteria, test coverage, database changes, and architectural decisions. Remove ambiguity ahead of time. AI handles code placement well but needs guardrails for production-ready output.
Keep execution right-sized. Tasks should complete autonomously without constant oversight. Purpose-built tools help generate appropriately scoped tasks for parallel execution across multiple agents.
Review every change. Even with proper planning, code rarely reaches production-ready status on first pass. Expect AI to reach approximately 80% completion, requiring manual refinement before merging.

Using Devplan in Practice

2025-08-07T06:00:00-07:00

This walkthrough explains how Devplan is used in real day-to-day development. More than 90% of the code shipped runs through Devplan, making it foundational for fast execution and AI-enabled development benefits.

The goals are to create a repeatable, scalable system where AI can:

Get to a working solution independently
Execute tasks in parallel
Require minimal human oversight

Without Devplan, the overhead of managing AI workflows can cancel out benefits. With it, the advantages are tremendous.

1. Define Product & Technical Specs with Devplan Agents

Projects start with Devplan’s agents helping define requirements. They ask clarifying questions, flag ambiguity, and scope work properly—grounded in codebase knowledge, past projects, and company structure.

This step is critical because the quality of AI questions surfaces misalignments or assumptions that would cause failures or multiple follow-ups. By the end, you have a clean, scoped project with resolved ambiguity.

2. Break the Project Down into Right-Sized Features

Devplan automatically breaks each project into individual features or user stories, with one prompt per feature.

Your job is light validation:

Are features correctly sized (ideally half-day to 5-day chunks)?
Are there too many or too few?
Do acceptance criteria make sense?

Thanks to planning in Step 1, this typically takes less than two minutes.

3. Run Prompts into Your AI IDE (Manual vs. Devplan CLI)

Once features and prompts are ready, run them in your IDE of choice—Claude, Cursor, Junie, etc.

Approach 1: Manual Execution

Per feature:

Download the generated prompt and format it for your IDE
Clone your repository or create a new worktree
Open your IDE manually in the correct folder
Prompt the AI to begin coding

Doing this 6–10 times per day becomes tedious, repetitive, and error-prone.

Approach 2 (recommended): Automated Execution with Devplan CLI

With Devplan CLI, overhead disappears. Spin up a feature-ready workspace with one command:

devplan clone -c XX -p YYYY -y -i cursor -f ZZZZ

This one-liner:

Creates a scoped cloned folder for the feature
Launches your IDE in correct context
Automatically references the correct prompt file

Then tell your AI agent: “Implement current feature.”

Before the CLI, time and energy were lost getting into features and switching between terminal, prompts, and IDEs. Parallel execution felt clunky, and small errors led to broken states. With the CLI, feature execution is fast, consistent, and repeatable—making scale possible.

4. Review and Polish the Output

This is the last human step before shipping. The amount of work drops dramatically if planning and prompting were done well.

Once the AI has written code:

Manually review the output
Fix issues or edge cases
Test to ensure it meets standards

Without this system, far fewer AI-generated features could complete per day. Devplan turns isolated prompts into a real production workflow.

Devplan makes AI-assisted development planning 8–10x faster compared to manually managing specs, prompts, repos, and execution. Overall coding execution is 2-3x faster. More importantly, it makes the workflow scalable.

Requirements Adjustments

When an AI-coding agent goes sideways, it’s often easier to restart with corrected requirements. This workflow allows full restarts in minutes or seconds.

Go back to Step 1 and update the PRD or tech design doc. Then regenerate features and prompts with a single click in the Build Plan. Finally, use the CLI to restart with updated requirements—usually under 2 minutes total.

Centralizing requirements means every change persists, even if the repo is replaced or you switch AI IDEs. Changes in rule files won’t carry over to the next feature and may be lost if you switch tools.

Conclusion

Some articles suggest AI may be a net loss for productivity. Indeed, without smart usage or good tooling, this may be true. For professional engineers who are already efficient, minimizing overhead while empowering AI is critical. Every minute of overhead and context switch matters. Used well, AI can make engineers more productive and the job itself more enjoyable.

How to Use AI to Code for Beginners

2025-07-06T06:00:00-07:00

You’ve got an idea that’s been sitting in your notes or bouncing around your brain for months. You may have dabbled with code or played with ChatGPT, but haven’t yet built a complete, working product.

Whether you’re a designer, PM, marketer, ops lead, or curious professional, you don’t need to become a software engineer to build a working MVP. You just need a clear plan and the right tools.

This guide will help you:

Define what you’re building
Set up a modern web development environment
Understand how web apps work
Start coding with real tools
Deploy your project live
Know where to go when you get stuck

Step 0: Define What You’re Building First

Before coding, you need a plan. Devplan breaks this down into 3 parts:

1. PRD (Product Requirements Document)

Write what your product does from the user’s perspective. Focus on functionality, not implementation. Example:

“Users can sign up with email and password”
“They see a personalized dashboard after login”

2. Technical Design

Define what components, pages, logic, and data are needed to implement the PRD. This gives structure to your code before you write it.

3. Build Plan

Devplan turns your tech design into scoped tasks. Each task comes with a pre-written AI coding prompt. You’ll use these directly inside your IDE (Cursor is recommended).

After generating your plan:

Copy/paste the prompts into Cursor
Or download the plan and run through them as you go

Tech Stack Overview

Here’s the modern web dev stack:

Layer	Tool
Frontend framework	Next.js
UI styling	Tailwind CSS
Language	TypeScript
Runtime	Node.js
Package manager	npm
Deployment	Vercel
Editor	Cursor (AI-native IDE)
Version control	Git + GitHub

This is a professional-grade stack used by teams at real startups. You’re not building a toy app.

Step 1: Set Up Your Environment

Open your terminal. You’ll be using it often—it’s where most real development happens.

1. Install Node.js + npm

Go to https://nodejs.org. Download the package and install it as you would any other app.

Check that it worked:

node -v
npm -v

You should see version numbers.

2. Install Git

Download from https://git-scm.com/downloads. Install with default settings.

Check it:

git --version

Step 2: Install and Use Cursor

Cursor is a developer environment based on VS Code with built-in AI. It’s perfect for working with Devplan’s AI prompts and helping you through roadblocks.

Install it and open the app.

Inside Cursor:

Create or open your project folder
Use the built-in terminal (View > Terminal or Ctrl+`)
Use the AI agent panel (right-hand side) to paste in prompts or ask for help

Step 3: Create a New Project with Next.js

In the Cursor terminal, run the Next.js scaffold command and choose these options:

TypeScript → Yes
Tailwind CSS → Yes
App Router → Yes
Customize src directory → No
Import alias → No
Install dependencies → Yes

When complete:

cd your-app-name
npm run dev

What’s happening here?

You’re starting a local server on your machine
Your app runs at http://localhost:3000 — this is only visible to you
Next.js watches your files — every time you save, it auto-refreshes the browser

Open app/page.tsx, change some text, and save. Watch the browser update instantly.

Step 4: Get Comfortable with the Command Line

You’ll be using terminal commands a lot. Here are a few you’ll use regularly:

Command	Purpose
`cd folder-name`	Change directory
`ls` (or `dir` on Windows)	List files
`npm install`	Install dependencies
`npm run dev`	Start local dev server
`Ctrl+C`	Stop the current process
`clear`	Clean up the terminal screen

When things go wrong, most errors will show up in the terminal. Read it carefully—it usually tells you what broke.

Step 5: Start Building Features in Cursor

Once your Devplan Build Plan is ready:

Open Cursor and your project folder
Go to your Devplan task list
Copy the AI prompt for the first task
Paste it into Cursor’s AI panel
Cursor will generate code inside the file it thinks you need. Review it before saving
Use the dev server to check progress at http://localhost:3000

Repeat for each task in your plan.

Tips:

If something breaks or errors show up: copy/paste from the browser and say “I just added this code and now I’m getting this error. What’s wrong?”
Let the AI do some of the heavy lifting, but try to read and understand the code it writes.

Step 6: Common Errors and Fixes

Error	Cause	Fix
`Command not found: npx`	Node.js not installed properly	Reinstall Node.js, restart Cursor
`EADDRINUSE: Port 3000`	Dev server already running	Stop with `Ctrl+C`, try again
Red squiggles in Cursor	Lint/type errors	Hover and let Cursor suggest a fix
`Cannot find module`	Import path or file doesn’t exist	Double-check file names and paths
Tailwind styles don’t apply	Misconfigured setup	Restart dev server after config changes
Broken layout	CSS or HTML errors	Use devtools in browser to inspect
`npm install` errors	Conflicting dependencies	Delete `node_modules`, run `npm install` again
`ReferenceError`, `undefined`, etc.	JS bugs	Read stack trace, debug in browser console or Cursor agent
App crashes on build	Mismatched imports or component nesting	Use `console.log()` to trace it, ask Cursor to help debug
Confused by what a file is doing	Too much AI-generated code	Ask Cursor: “Explain what this file does and how it works”

Step 7: Deploy to Vercel

Once your app works locally, deploy it:

1. Push your project to GitHub

git init
git add .
git commit -m "initial commit"
gh repo create your-app-name --public --source=. --remote=origin --push

If you don’t have GitHub CLI (gh) installed, you can create a repo on the GitHub site and push it manually.

2. Deploy to Vercel

Go to https://vercel.com
Sign in with GitHub
Import your repo
Click Deploy

Vercel gives you a public URL in seconds. Push to GitHub again anytime you want to update the live site.

Summary: What You Just Set Up

Step	Outcome
Devplan	Structured PRD, Tech Design, and Build Plan with AI prompts
Cursor	AI-native IDE with terminal + prompt-based code generation
Next.js app	Full frontend app running locally at `localhost:3000`
Command line	Used to install, run, and debug
Vercel deploy	Live app online, ready to share

Final Notes

You will hit errors. That’s part of it. But now you have:

A plan (Devplan)
An assistant (Cursor)
A working stack (Next.js, Tailwind, Node, Git)
A feedback loop (localhost → fix → refresh → repeat)

Take it one feature at a time. Each small thing you ship teaches you how real software is built.

Outcome-Based Agile

2025-06-23T06:00:00-07:00

Most teams today operate in tension between agile process and business pressure. Customers want dates. Sales and marketing need launch timelines. Leadership needs coordination. And engineers just want clarity, not chaos.

But too often, teams are buried in ceremonies, documentation, and shifting priorities, while the core question of “what are we delivering, and when?” remains hard to answer. The result is misalignment, burnout, and wasted energy trying to connect team activity with business expectations. Traditional Agile tends to emphasize sprint plans, story points and velocity, but this is not the language of customers or the business. Outcome-Based Agile puts outcomes and deliverables at the center, making impact, not activity, the measure of success.

Outcome-Based Agile is a pragmatic delivery model based on decades of experience at top companies where teams plan and deliver a meaningful outcome in a set timeframe. It gives the business predictability and gives teams the flexibility to build smart and tie their work directly to customer impact.

The most successful projects don’t wing it. They invest in planning upfront, then execute fast and clean. That’s the core principle here: define the outcome, shape the scope, then build with confidence and have product, engineering, and go-to-market teams all in sync.

What Outcome-Based Agile Is

In Outcome-Based Agile, the project is the unit of planning, not the sprint. Impact-driven projects are the language of business and the outcome that customers care about. This is why projects are what is tracked inside of most major tech companies, not sprints (even if a given team is working in sprints). Projects make outcomes visible, provide containers for planning and tracking, and align cross-functional teams like sales, marketing, and support around a shared timeline. Teams still ship incrementally throughout the project behind feature flags, so they can test and validate early, but launches are what teams are driving toward. Smaller issues or standalone tickets still fit into the plans, but are typically tracked independent of projects and have dedicated time allocated to them (e.g. 20% of time for non-project work).

Simply put, Outcome-Based Agile is:

A clear set of customer-aligned deliverables with target timeframes
A defined set of features and scope in each that can flex during execution
Continuous development shipped incrementally and launched aligned with the business

How It Works

Set the Outcome. Define the business impact. Example: “Increase activation by 15%.”
Explore the Problem Space. Evaluate different solutions for this problem space. Example: “Streamline our onboarding process.” or “Add demo mode.” or “Add inline tutorials.”
Plan the Scope. Write a PRD or product brief for the chosen solution(s). Create prototypes with AI to illustrate the solution in action. As a team, align on specific requirements, project scope, UX flows and technical approach. Break work into user stories, and create high-level estimates.
UX and Tech Design. UX and tech design is created and reviewed by the team. Refine the PRD, break user stories into tasks, update estimates and prepare prompts for AI coding.
Build + QA Continuously. Engineers and AI agents execute together on scoped stories. Updates are shipped behind flags to test safely before launch.
Stakeholder Visibility. Real-time updates based on git check-in and team demos to show progress, risks, and trade-offs. Business stakeholders stay aligned, not surprised.
Launch + Measure. Launch the deliverable in coordination with marketing and sales. Track the impact and capture learnings.

Why It Works

Aligns delivery with business impact
Ships continuously, but launches intentionally
Keeps teams focused, autonomous and outcome-focused

How Devplan Supports It

Devplan gives modern teams the structure to run Outcome-Based Agile with AI-native workflows:

Contextually-aware agents for PRDs and user story creation
Data-driven automated estimates and confidence scores with risk identification
Built-in prototype support with design guidance
Agent-guided technical design for key architecture decisions
Task breakdown optimized for AI agents
CLI-based developer workflow to pull in detailed instructions to IDE
Automated stakeholder updates (coming soon)

Outcome-Based Agile is already how high-functioning, established teams at top companies build and launch. We just gave it a name and supercharged it with AI inside of Devplan.

Introducing the Devplan CLI

2025-05-19T06:00:00-07:00

The Devplan CLI is your new command-line companion for bringing structured, AI-assisted product development directly into your workflow.

Whether you’re using Cursor or JetBrains Junie, the Devplan CLI gives you the power to bridge product planning and real code execution, right inside your AI-enabled IDE.

You already know the pain of jumping between docs, tickets, Slack threads, and your editor. The CLI eliminates that mess. Go from rough idea to production-ready feature in record time, with confidence that what you’re building is scoped, aligned, and complete.

What Is It?

The Devplan CLI is a command-line interface built in Go that connects to Devplan’s backend via secure, protobuf-based APIs. It pulls custom rules files along with detailed project requirements, test cases, architectural guidance and edge cases built specifically for coding agent and delivers the output directly into your IDE.

With the CLI, you can:

Fetch scoped work directly from your Devplan projects
Inject coding agent-ready instructions into your IDE of choice
Keep features and requirements in sync from product definition to engineering development

In short, it’s the orchestration layer between product thinking and code execution.

How Do You Use It?

Here’s how to use it:

Authenticate once, stay logged in securely
Select your IDE, company, project, and feature interactively
Sync with your Git repo and bring in context-aware tasks
Focus on a feature, and Devplan guides you with scoped instructions
Pull clean, AI-generated plans directly into your local dev workflow

Getting Started in 60 Seconds

1. Install the CLI

/bin/bash -c "$(curl -fsSL https://app.devplan.com/api/cli/install)"

2. Authenticate with Devplan

This sets up your credentials securely and gives the CLI access to Devplan projects.

3. Initialize in Your Repo

Navigate to your local repo, then you’ll be prompted via terminal UI to pick your company, project, feature, and IDE.

4. Pull Instructions into Your IDE

In your coding agent input, type the command to pick up and run the latest scoped feature instructions.

Full List of Commands

Command	Description
`auth`	Authenticate with Devplan to enable all other functionality securely.
`clean`	Clean up individual repositories from your Devplan workspace.
`clone`	Clone a repository and immediately focus on a feature for fast onboarding.
`focus`	Focus on a specific feature from your selected project.
`help`	View help content and usage examples for any command.
`self`	Display information about your currently authenticated user.
`update`	Update your CLI to the latest version with one command.
`version`	Print the current CLI version (useful for debugging or CI logs).

Try It Now

If you’re already using Devplan, just run the install command above, then authenticate and initialize in your repository. You’ll be moving your next feature into production faster than ever.

We’ll be sharing more CLI tips, workflows, and advanced usage patterns soon!