Diagram showing 19 AI agent squads coordinating through shared artifacts, GitHub issues, and memory files
Engineering

AI Agent Orchestration at Scale: How 19 Squads Actually Coordinate

By Agents Squads · · 13 min

TL;DR — We run 19 AI agent squads in production. The orchestration layer isn’t a framework or a control plane — it’s a set of conventions: GitHub issues as work queues, markdown files as memory, PRs as handoff artifacts. This post is the mechanics behind our operational story. Not what we built — how it actually coordinates.

The Problem With “Multi-Agent” Framing

Most writing about multi-agent systems focuses on the agent graph: who calls whom, how outputs chain, what the orchestrator decides. That framing assumes there’s a central nervous system doing the routing.

We don’t have one.

What we have is closer to a distributed organization than a pipeline. Nineteen squads, each owning a domain — engineering, finance, marketing, intelligence, and more — running independently, producing outputs that other squads consume. The coordination isn’t centralized. It emerges from shared conventions about where to put things and how to signal completion.

Understanding why we designed it this way requires understanding the failure mode we were avoiding: orchestrator bottlenecks. When every agent interaction routes through a single orchestrator, that orchestrator becomes the constraint. It needs full context about every domain, burns tokens on routing decisions that don’t require intelligence, and becomes a single point of failure. We opted out of that architecture early.

The trade-off is that you need explicit coordination primitives. Agents don’t talk directly — they write to shared surfaces that other agents read.

The Three Coordination Surfaces

Everything in our orchestration system passes through one of three surfaces.

1. GitHub Issues as Work Queues

When a squad needs another squad to do something, it creates a GitHub issue. That’s the protocol. The finance squad doesn’t call the data squad — it creates an issue in the appropriate repo tagged with priority:P1 and the data squad’s agent picks it up on the next run.

This has several properties we didn’t fully appreciate until we’d been running it for three months:

It’s asynchronous by default. No squad blocks waiting for another squad’s output. The requesting squad creates the issue and moves on. The work gets done when it gets done.

It’s visible. Every cross-squad dependency is tracked in GitHub. No implicit data flows, no undocumented handoffs. When something breaks, the issue trail shows exactly what was requested and when.

It’s durable. If an agent run fails midway, the issue still exists. The next run picks up where the last one left off. We don’t lose work because an agent timed out.

The failure mode is issue proliferation. We’ve had periods where the queues grow faster than agents drain them — especially when scan-type agents are running frequently. The discipline that helps is tagging issues properly and having lead agents triage priority before dispatching workers.

2. Markdown Files as Memory

Each squad maintains persistent memory in the hq repository. Squad-level memory captures what the squad has learned. Agent-level memory captures what specific agents have learned. State files capture what’s happening right now — current tasks, recent outputs, blockers.

The memory architecture is intentionally simple: plain markdown files, version-controlled in git. No vector stores, no graph databases. When an agent starts a run, it reads its state file and squad memory. When it finishes, it updates them.

This produces a genuinely useful property: agents improve over time without retraining. Our intelligence squad’s monitor agents have accumulated months of competitive intelligence. When they analyze a new entrant to the market, they’re comparing against a knowledge base they’ve been building since November. A fresh agent would see an isolated data point. These agents see a pattern.

The failure mode we’ve hit: state file staleness. State files are meant to capture ephemeral current state, but they accumulate. Right now, roughly 141 of 147 agent state files are stale — older than seven days. When an agent reads its state file and finds month-old “current task” data, it either ignores it or, worse, acts on it. We’ve built a rotation script that prunes stale entries, but enforcing it at scale is an ongoing operational challenge.

The deeper lesson: memory architecture needs explicit lifecycle management. Writing is easy. Knowing when to forget is hard.

3. Pull Requests as Handoff Artifacts

When a squad produces code, content, or configuration, it creates a PR. That’s the output artifact. Other squads don’t get a direct data transfer — they get a PR they can review, merge, or reject.

This matters for quality. Our content-worker agent doesn’t push directly to the website — it creates a PR that can be reviewed before it goes live. Our engineering squad’s issue-solver agents create PRs that go through the same review process as human-authored code. The PR is the quality gate.

It also creates a natural audit trail. Every agent output that touches production passes through a PR. If something breaks, the PR history shows exactly what changed, who (or what) changed it, and why.

How Squads Get Triggered

Agents in our system run on three trigger types. Understanding which trigger to use for which work is one of the more important architectural decisions we make when designing a new squad.

Event-driven (issue_labeled, pr_created, push): The most common trigger for worker agents. When a GitHub issue gets labeled priority:P1, relevant worker agents wake up. When a PR is created in a monitored repo, review agents run. This is the right trigger for reactive work — when something happens, respond to it.

Scheduled (cron): Used for monitoring and scan agents that need to run periodically regardless of explicit triggers. Our operations scanner checks infrastructure health daily. Our content scanner checks the publishing calendar weekly. The failure mode here is schedule-driven busywork — agents running on schedule but producing noise because there’s nothing actionable to work on. We’ve moved most of our scheduled work to longer intervals after discovering this.

Manual (squads run): Used for strategic work that a human decides to kick off. The cofounder runs a squad manually when a specific deliverable is needed — a competitive analysis, a financial report, a content sprint. This keeps humans in the decision loop for high-judgment work while delegating execution.

The orchestration patterns post covers the theoretical static-vs-dynamic split. What we’ve found in practice: event-driven beats scheduled every time for output quality. Agents triggered by real events have real work to do. Agents running on schedules often find nothing worth doing and produce low-signal output to justify the run.

Cross-Squad Coordination in Practice

Here’s how a real workflow crosses squad boundaries.

The intelligence squad’s market monitor flags a competitor shipping a new feature. It writes that finding to its domain memory file and creates a GitHub issue: “[Intel] Competitor X launched agent memory sync — assess implications.” The issue gets labeled priority:P1.

The intelligence lead picks up the issue, reads the domain memory for context, and dispatches worker agents: one to research the feature, one to assess competitive positioning, one to check if we have a comparable capability.

The research output goes into a report in the research repo. The competitive assessment updates the intelligence squad’s memory. A new GitHub issue gets created in the marketing repo: “[Intel] Update competitive comparison — competitor X now has memory sync.”

The marketing squad’s content worker picks up that issue, reads the competitive positioning notes, and either updates existing content or drafts a new piece. That goes out as a PR to the website repo.

Total humans involved: zero (until someone reviews the marketing PR). Total squads involved: three (intelligence, research, marketing). Total time from signal to PR: hours, depending on queue depth.

This is what multi-agent coordination looks like when it works. The agent handoffs happen through shared surfaces, not direct calls. Each squad does its domain work. The coordination emerges from conventions, not a central planner.

The Failure Modes We Didn’t See Coming

We’ve covered what works. Here’s what broke.

Cross-service dependency failures. The most expensive lesson: we shipped a “Sign In” button on the website while the API had unapplied database migrations and the console was nine days stale. Each squad’s work looked complete in isolation. From the user’s perspective, nothing worked. We now require any user-facing feature to document all dependent services and confirm each is deployed. Agents can’t see across service boundaries unless you explicitly build that check.

The quality vs. activity measurement gap. Our eval system measures agent activity — how many runs, how many issues closed, how many PRs created. It doesn’t measure whether the outputs were good. We’ve had squads produce high activity scores while the actual work was mediocre. Context engineering helps keep agents sharp, but without quality gates, activity metrics are misleading. We’re still working on this.

Escalation without resolution. Our agents can flag blockers. They’re less good at knowing when they’ve done something wrong. When an agent produces bad output and creates a PR, it looks like success in the metrics. Only human review catches the problem. For high-volume pipelines, you need explicit quality checkpoints that are distinct from the agents producing the work.

Coordination overhead at scale. With 19 squads and 240+ agent definitions, the meta-work of coordinating the coordinators becomes real. Lead agents briefing worker agents, agents reading other agents’ state files, memory rotation — all of this takes time and tokens before any actual work happens. We’ve learned to budget for coordination overhead when planning capacity. Context optimization is part of the answer, but the organizational complexity itself has a cost that frameworks don’t acknowledge.

What “Orchestration” Actually Means at This Scale

The word “orchestration” implies a conductor — something at the center directing everyone else. Our system doesn’t work that way. A better metaphor is a city: lots of independent actors, coordination through shared infrastructure (roads, signals, conventions), no one entity controlling the whole.

The shared infrastructure in our case is GitHub (issues, PRs, labels), git (memory files, version history), and a small set of conventions about how squads interface. The conventions are what enable coordination without centralization:

These conventions are enforced by the squad architecture and baked into agent definitions. New squads inherit the patterns. New agents follow the patterns their squad uses.

The result is a system that scales horizontally — adding a new squad doesn’t require changing the orchestration layer, just adding definitions that follow the conventions. Whether this holds past 25 or 30 squads, we don’t know yet. We’ll find out.

The Numbers, Honestly

Running 19 squads for roughly $89/month in infrastructure sounds like a great deal. It is, for the API costs. But the total cost includes something that doesn’t show up in the invoice: founder time spent on coordination, debugging, and quality review.

When we count that time, the picture is more nuanced. High-autonomy squads — content, code review, intelligence scanning — run with minimal supervision. Low-autonomy squads — strategic planning, novel problem-solving, any work requiring relationship judgment — need more human oversight than they save. We’ve been honest about this in our operational write-up.

The orchestration layer reduces the coordination overhead, but it doesn’t eliminate the judgment gap. Agents that coordinate well still fail when the task requires knowledge they don’t have, judgment they haven’t developed, or relationships they can’t have. The system is good at automating the structured 80%. The other 20% is still human work.

That 80/20 split is what makes the economics work. Not that agents replace humans everywhere, but that agents handle the high-volume, structured work that would otherwise consume all of a founder’s time — leaving room for the work that actually requires a human.

Getting Started Without 19 Squads

The orchestration mechanics described here didn’t exist when we started. We built them incrementally, starting with three squads (engineering, marketing, intelligence) and adding coordination primitives as the complexity demanded them.

If you’re building your first multi-agent system with squads-cli, the right starting point is the conventions, not the scale:

npm install -g squads-cli
squads init

Start with GitHub issues as your work queue. Write agent outputs to markdown files. Create PRs for anything that touches production. These three habits will serve you whether you have three squads or thirty.

The orchestration complexity we’ve built is the result of scaling a system that was simple by design. Simple conventions, followed consistently, compound into something that actually works.


This is the mechanics behind what we built. For the full operational story — economics, what actually works, and what breaks — see How We Run a Real Company with AI Agent Squads. Want to build your own AI workforce? Talk to us.

Related Reading

Back to Engineering