# Background Agents and the Self-Driving Codebase

> An enterprise guide to the next era of software delivery. Published by Ona (ona.com). This document contains the complete text content of background-agents.com and the accompanying white paper.

Site URL: https://background-agents.com
Publisher: Ona (https://ona.com)
White paper PDF: https://background-agents.com/background-agents-white-paper.pdf
White paper HTML: https://background-agents.com/background-agents-white-paper.html

---

## The Problem: Our Systems Are Bottlenecked by Localhost

Autocomplete became coding agents. Coding agents became three coding agents running in parallel on your laptop. Engineers reached for worktrees, more and more terminals, spare Mac Minis. Anything to run more agents.

But localhost wasn't built for this. Agents fight over machine state, secrets become exposed, and everything stops when the machine sleeps.

That works for indie-hackers, but is untenable for professional engineering. We must decouple engineers from workstations. Agents need to run in the background, securely, and at scale.

---

## The False Summit

Individual speed ≠ organizational velocity.

You rolled out coding agents. Engineers are faster. PRs flood in. Yet, cycle time doesn't budge. DORA metrics are flat. The backlog grows.

Because gains are compounding with the individual, not the organization. The longer you invest in coding agents without addressing the system around it, the deeper you entrench. This is the false summit.

---

## What Is a Background Agent

A coding agent needs your machine and your attention. A background agent needs neither. It runs in its own development environment in the cloud: full toolchain, test suite, everything. Completely decoupled from your device and your session.

Kick one off from your laptop, check the result from your phone. Trigger it from a PR, a Slack thread, a Linear ticket, a webhook, or just spin one up manually.

You're not steering it. You're not watching it. It's an asynchronous task: delegate, walk away, review later. It runs for as long as it needs to.

### Coding Agent vs Background Agent Comparison

| Dimension | Coding Agent | Background Agent |
|---|---|---|
| Where they run | Your laptop or local machine | Cloud infrastructure, triggered remotely |
| How they're triggered | You invoke them manually | Events, schedules, Slack messages, API calls — any signal |
| Scope | Single task in one repo | Across repos, teams, and the full SDLC |
| Developer's role | In the loop — watching, steering, iterating | On the loop — prompt, walk away, review the output |
| Session lifetime | Tied to your machine and your session | Runs independently for as long as it needs |
| Governance | Inherits your credentials and access | Scoped permissions, audit trails, blast-radius controls |
| Coordination | One agent, one task | Fleets across hundreds of repos, swarms on complex tasks |

### Background Agents vs CI/CD Pipelines

CI/CD pipelines execute predefined steps — build, test, deploy. They don't make decisions or generate new code. Background agents are autonomous: they receive a trigger, reason about the problem, write code, run tests, and open a pull request.

A CI pipeline tells you a dependency is outdated. A background agent updates it, verifies nothing breaks, and submits the fix for review.

A CI pipeline reports a test failure. A background agent investigates the failure, identifies the root cause, writes a fix, and opens a PR — before a developer is paged.

---

## Step 01: Establish Background Agent Primitives

Autonomous agents need infrastructure that doesn't exist on your laptop. The building blocks below are what separate a demo from a deployment — sandboxed execution, governance, connectivity to your internal systems, trigger automation, and fleet coordination. Each one unlocks the next.

### Primitive: Sandboxed Execution — The Agent Needs a Computer

Agents running in the background need their own execution environment with a full toolchain, ability to run tests and access via secrets to systems.

Environments should be isolated, and reproducible with close parity to production systems to allow fleets of agents. Everything else builds on top of this.

**Why it matters.** Without isolation, one agent's failure can cascade into another's work. Without a full toolchain, the agent can't run the same commands your engineers run. Without reproducibility, you can't trust the output.

**What happens without it.** Agents share machine state, fight over ports and file locks, and produce results that work on one machine but not another. Security teams veto autonomous agents because there's no blast-radius control.

**What it looks like in practice.** Each agent gets its own development environment — a full VM or container with the complete toolchain, test suite, and build system. The environment spins up on demand, runs the task, and tears down when done. Hundreds can run in parallel without interference.

**The architecture question.** There are two patterns emerging: centralised (all agent environments run in a shared cloud platform) and distributed (agent environments run inside the customer's own infrastructure). The choice depends on security requirements, data residency, and how much of the stack you want to own.

### Primitive: Governance — Enforced at Runtime, Not by Prompt

Agents are actors in your system. They need the same controls as human contributors — identity, permissions, audit trails.

The difference: governance enforced by a system prompt ("please don't delete files") is a suggestion. Governance enforced at the execution layer — deny lists, scoped credentials, deterministic command blocking — is actual governance. Without it, security teams veto autonomous agents entirely. And they're right to.

**Why it matters.** An ungoverned agent with broad credentials is a liability, not an asset.

**What happens without it.** No audit trail for what agents did and why. No way to scope access per task. No way to enforce policies consistently across agent runs. Every agent has the same access as the developer who triggered it — which is usually far more access than the task requires.

**What it looks like in practice.** Each agent run has a scoped identity with least-privilege credentials. Command deny lists prevent destructive operations. Every action is logged to an audit trail. Human review gates (typically pull requests) sit between agent output and production. Blast-radius controls ensure a single failure can't cascade.

### Primitive: Context & Connectivity — Behind Your Firewall

A sandbox that can't reach your internal systems is a toy. Agents need to assume IAM roles, query database replicas, hit internal APIs, and pull from private registries — all from inside your network. Context and connectivity turn isolated execution into real work.

**Why it matters.** Enterprise codebases don't exist in isolation. They depend on internal services, private packages, proprietary build tools, and data that lives behind the firewall. An agent that can't reach these systems can't do real work.

**What happens without it.** Agents produce code that compiles but doesn't integrate. They can't run integration tests, can't access the real dependency graph, can't validate against production-like data. The output requires manual rework to actually ship.

**What it looks like in practice.** Agent environments run inside the customer's network or have secure tunnels to internal systems. They can assume IAM roles, access private registries, query database replicas, and use internal CLIs — the same access a developer's machine has, but scoped and audited.

### Primitive: Triggers — Remove the Human from the Invocation Loop

If every agent run starts with a developer typing a prompt, you haven't automated the workflow — just the work. Triggers connect agents to the events that matter: schedules, webhooks, system signals. Each pattern maps to a different scope and cadence.

**Scheduled agents.** Triggered on a timer. Predictable, bounded, high-volume — dependency updates, lint sweeps, coverage enforcement.

**Event-driven agents.** Triggered by system events — a PR opened, a CVE published, an alert fired. Reactive, concurrent, always listening.

**Agent fleets.** One task across many repositories. Each agent works independently and produces its own contribution.

**Agent swarms.** Many agents, one outcome. Every agent works on a different facet and results converge into a single deliverable.

**Mobile triggers.** Trigger one, or many agents direct from your phone, or iMessage. One text, and a fleet fans out.

### Primitive: Fleet Coordination — One Intent, Every Repo

Updating one repository is a coding agent task. Updating 500 is a fleet task. The same sandbox, replicated across every repository that needs the change — parallel provisioning, progress tracking, aggregated results. This is where individual productivity becomes organizational throughput.

**Why it matters.** A single agent updating a single repo is useful. A fleet updating every repo in the organisation simultaneously is transformative.

**What happens without it.** Large-scale changes require manual invocation per repo, manual tracking of progress, and manual aggregation of results. The "500 repo update" becomes a multi-week project instead of a multi-hour operation.

**What it looks like in practice.** One intent ("update this dependency across all repos") triggers parallel agent runs — one per repo. Each agent works independently in its own environment. Progress is tracked centrally. Results are aggregated into a dashboard. Failed runs are retried or escalated automatically.

---

## Step 02: Find Your Systems Bottlenecks

The primitives give you the capability. Where you apply them is what matters. That means doing the work: surveying your developers, sitting with your teams, mapping where time goes. Every organization's bottlenecks are different. The ones worth solving first aren't always obvious.

### Seven High-Value Starting Points

1. **Code reviews that pile up faster than ever.** PRs sit for hours while you context-switch. At scale, review queues back up and lead time stays flat despite faster coding. A background agent reviews every PR before a human sees it, so reviewers focus on design, not formatting.

2. **CI failures nobody has time to investigate.** Your build breaks and you spend 20 minutes reading logs to find the one flaky test. At scale, flaky tests waste hundreds of re-runs per week. A background agent triages failures, identifies flaky tests, and opens fix PRs.

3. **Merge conflicts that pile up between branches.** You pull main and spend 30 minutes resolving conflicts before you can continue. With multiple teams on the same codebase, it happens daily. A background agent resolves the mechanical conflicts and flags the rest.

4. **CVEs that sit unpatched for weeks.** You see a GitHub advisory and add it to your list. At scale, the same patch needs applying across dozens of repos and time-to-fix is measured in sprints. A background agent applies the fix, runs tests, and opens PRs across every affected repo.

5. **Test coverage that never gets written.** You know parts of your codebase have no tests but never find time to add them. Legacy code ships with zero coverage because there's no safety net. A background agent generates tests for uncovered paths and validates them against existing behavior.

6. **Code standards that vary between every repo.** You forget to run the linter before pushing. Across teams, standards vary between repos and enforcement is patchy. A background agent checks every PR against your standards and auto-fixes the straightforward violations.

7. **Release notes that nobody writes.** You tag a release and try to remember what changed. At scale, release notes are either skipped or someone spends half a day reading merged PRs. A background agent compiles changes, groups them by type, and drafts notes ready for review.

**The principle:** Start with the most boring work, not the most impressive. The highest-value automation targets are repetitive, well-defined tasks with bounded blast radius.

---

## Step 03: Scale Your Software Factory

The engineering organization is an industrial system. Today, developers stand at every station: writing, reviewing, testing. Background agents change the operating model. The factory runs, but your engineers move *on* the loop instead of *in* it.

### The Checklist — What "Done" Looks Like

- Every PR is reviewed by an agent before a human sees it
- CI failures are investigated and fixed before a developer is paged
- Developers never manually resolve merge conflicts on agent PRs
- Agents do the first-pass investigation into every production incident
- Security vulnerabilities are patched within hours, not sprints

---

## Towards a Self-Driving Codebase

Your engineers aren't in the loop. They're on the loop.

The factory floor is running. Code is being written, reviewed, tested, deployed — continuously, autonomously. Your engineers are observing. Setting constraints. Verifying outcomes.

This is the destination: a codebase that drives itself. Not a codebase without humans — a codebase where humans do different work. Instead of writing every line, they're designing agent workflows. Instead of reviewing every PR manually, they're setting the constraints that agents enforce. Instead of investigating every CI failure, they're improving the system that investigates for them.

The shift is from imperative to declarative. You stop telling agents what to do and start declaring what should be true:

- Dependency versions should always be current.
- Test coverage should never drop below the threshold.
- Security vulnerabilities should be patched within 24 hours.
- Every PR should be reviewed before a human sees it.

You declare the desired state. Agents ensure it. Continuously, across every repo, without being asked.

The organisations that get there won't be the ones with the best coding agents. They'll be the ones that redesigned the system around autonomous execution — the ones that stopped optimising individual stations and started building the factory.

---

## Frequently Asked Questions

**Can I just run coding agents in the background on my laptop?**

You can, and many teams do — multiple terminals, git worktrees, maybe a Mac Mini humming in the corner. But that's agents running in the background, not background agents. There's no event triggering, no governance, no audit trail, and everything stops when the machine sleeps. The gap between those two things is where most organisations stall.

**What infrastructure do background agents need?**

Three things that don't exist on your laptop: isolated compute environments that spin up on demand, an event system that routes triggers (PR opened, CVE published, Slack message, cron schedule) to the right agent, and a governance layer for permissions, audit trails, and blast-radius controls.

**Are background agents safe?**

They're as safe as the infrastructure you put around them. The key controls are sandboxed environments with scoped access, human review gates on all output (typically via pull requests), audit trails for every action, and bounded blast radius so a single failure can't cascade. The risk isn't the agent — it's running agents without these controls.

**Do background agents replace developers?**

No. They shift what developers do. Instead of writing every line, developers move "on the loop" — reviewing agent output, calibrating behaviour, designing systems, and focusing on work that requires judgment. You still look at the code. You still own the decisions. The difference is you're reviewing work, not doing it keystroke by keystroke.

**What are common use cases for background agents?**

The highest-value starting points are repetitive, well-defined tasks with bounded blast radius: dependency updates across hundreds of repos, CVE remediation within hours of disclosure, CI pipeline migrations, lint and standards enforcement, test coverage expansion, and code review triage. Start with the workflows costing the most time, money, or risk.

**How are background agents different from CI/CD pipelines?**

CI/CD pipelines execute predefined steps — build, test, deploy. They don't make decisions or generate new code. Background agents are autonomous: they receive a trigger, reason about the problem, write code, run tests, and open a pull request. A CI pipeline tells you a dependency is outdated. A background agent updates it, verifies nothing breaks, and submits the fix for review.

**How do we measure ROI on background agent infrastructure?**

Start with time-to-resolution metrics on the tasks you automate. If CVE patching took two weeks and now takes two hours, that's measurable. If code review turnaround dropped from 8 hours to 20 minutes for the first pass, that's measurable. Aggregate across the number of repos and the frequency of the task. The ROI compounds with scale — updating one repo saves minutes, updating 500 saves weeks.

**What's the change management path for engineering teams?**

Start with the most boring work. Engineers won't resist automating dependency updates or lint fixes — they'll be relieved. Build trust with low-risk, high-volume tasks before expanding to more complex workflows. Expect that developers will over-review agent output for the first two weeks, then under-review it. Build review habits early. The teams that succeed treat agent output like junior developer output: review everything, trust incrementally.

---

## Case Study Teardowns

### Stripe — Minions

Stripe's Minions platform is one of the most publicly documented background agent deployments. Built on custom infrastructure inside Stripe's network, Minions are one-shot, end-to-end coding agents that receive a task, execute it autonomously, and produce a pull request.

Key architectural decisions:
- Custom agent framework on top of foundation models with multiple model support and task-specific agent profiles
- Full development environments on internal cloud infrastructure — each agent gets an isolated VM with the complete Stripe toolchain
- Internal system access — agents run inside Stripe's network with access to internal build systems, test runners, linters, and deployment tools
- Human review gates — all agent output goes through the standard PR review process

What's transferable: The pattern of giving agents full development environments (not just code generation sandboxes) with access to the real toolchain. The insistence on human review gates for all output. The investment in task-specific agent profiles rather than one-size-fits-all prompting.

### Ramp

Ramp published a detailed account of why they built their own background agent infrastructure rather than using off-the-shelf tools. Their reasoning centred on the gap between what coding agents could do and what their engineering organisation needed.

Key architectural decisions:
- Purpose-built for their SDLC — the agent infrastructure was designed around Ramp's specific workflows, not generic coding tasks
- Event-driven triggers — agents respond to system events (PRs, deployments, alerts) rather than human prompts
- Integrated with internal systems — agents have access to Ramp's internal APIs, databases, and tooling

What's transferable: The decision framework for build-vs-buy. The recognition that generic coding agents don't address organisational bottlenecks. The focus on event-driven triggers as the key differentiator from "coding agents running in the background."

### Spotify

Spotify's engineering team published a three-part series documenting their background coding agent deployment — over 1,500 PRs generated by agents running autonomously.

Key architectural decisions:
- Context engineering — extensive investment in wiring organisational knowledge (style guides, ADRs, internal APIs, past reviews) into agent workflows
- Strong feedback loops — systematic evaluation of agent output quality, with metrics driving continuous improvement of agent behaviour
- Progressive rollout — started with low-risk, high-volume tasks and expanded scope as confidence grew

What's transferable: The emphasis on context engineering as the differentiator between generic output and output that fits your codebase. The feedback loop methodology — treating agent quality as a metric to optimise, not a binary pass/fail. The progressive rollout strategy that builds organisational trust incrementally.

---

## Further Reading

1. [1,500+ PRs Later: Spotify's Journey with Our Background Coding Agent](https://engineering.atspotify.com/2025/11/spotifys-background-coding-agent-part-1) — Spotify
2. [Minions: Stripe's One-Shot, End-to-End Coding Agents](https://stripe.dev/blog/minions-stripes-one-shot-end-to-end-coding-agents) — Stripe
3. [Why We Built Our Background Agent](https://builders.ramp.com/post/why-we-built-our-background-agent) — Ramp
4. [How Uber Uses AI for Development: Inside Look](https://newsletter.pragmaticengineer.com/p/how-uber-uses-ai-for-development) — Pragmatic Engineer
5. [Harness Engineering: Leveraging Codex in an Agent-First World](https://openai.com/index/harness-engineering/) — OpenAI
6. [Towards Self-Driving Codebases](https://cursor.com/blog/self-driving-codebases) — Cursor
7. [How StrongDM's AI team build serious software without even looking at the code](https://simonwillison.net/2026/Feb/7/software-factory/) — Simon Willison
8. [Time Between Disengagements: The Rise of the Software Conductor](https://ona.com/stories/time-between-disengagements-the-rise-of-the-software-conductor) — Ona
9. [Ramp Inspect: Full Background Agent Architecture Diagram](https://x.com/rahulgs/status/2016284233438793793) — Ramp
10. [How Ramp Built a Background Coding Agent That Writes Over Half of Its Pull Requests](https://www.youtube.com/watch?v=ii9E5z8Gsrc) — Ramp
11. [How Spotify Built Honk: From Backstage to Agentic Coding at Scale](https://backstage.spotify.com/how-spotify-built-honk) — Spotify
12. [Background Coding Agents: Context Engineering](https://engineering.atspotify.com/2025/11/context-engineering-background-coding-agents-part-2) — Spotify
13. [Background Coding Agents: Predictable Results Through Strong Feedback Loops](https://engineering.atspotify.com/2025/12/feedback-loops-background-coding-agents-part-3) — Spotify
14. [Minions: Stripe's One-Shot, End-to-End Coding Agents — Part 2](https://stripe.dev/blog/minions-stripes-one-shot-end-to-end-coding-agents-part-2) — Stripe
15. [Expanding Our Long-Running Agents Research Preview](https://cursor.com/blog/long-running-agents) — Cursor
16. [From Craft to Mass Production: Software as an Industrial System](https://ona.com/stories/industrializing-software-development) — Ona
17. [The Two Patterns by Which Agents Connect Sandboxes](https://x.com/hwchase17/status/2021261552222158955) — LangChain

---

## About Ona

Ona (formerly Gitpod) builds infrastructure for background agents. Learn more at https://ona.com.