How Multiple AI Agents Coordinate Without Stepping on Each Other

The first AI agent I built did everything itself. Read the code, wrote the code, ran the tests, fixed the bugs. That works for small tasks. It falls apart when the work gets big enough that you need parallelism.

The question is not whether you need multiple agents. You do. The question is how they share a workspace without corrupting each other's work.

I have been running multi-agent systems for months now — Karl Code's orchestrator with its 12 specialized agent profiles, a review system with parallel analyzers, a TTS pipeline with worker processes, and this blog's daily drafting cron. Here is what I have learned about making agents cooperate without stepping on each other.

Rule One: One Writer at a Time

This is the most important rule and the one most people get wrong.

When multiple agents write to the same files simultaneously, you get race conditions. Agent A reads a file, Agent B reads the same file, Agent A writes its changes, Agent B writes its changes and overwrites Agent A's work. Classic. The kind of bug that has existed since shared state was invented.

The fix is simple: serialize writes. Only one agent writes to a given file at a time.

In Karl Code, this means the orchestrator delegates write tasks to one subagent at a time. The code agent implements a feature. The test agent writes tests. But they do not run simultaneously against the same files. The orchestrator sequences them: code first, then test, then verify.

Parallel reading is safe. Parallel writing is not. If you remember nothing else from this post, remember that.

Rule Two: Parallel Reading Is Free

While writes must be serialized, reads should be parallelized as aggressively as possible.

A review system I run sends a draft to multiple analyzers at once. Each analyzer needs the full draft to do their job. There is no reason to make them wait for each other. Each one reads, evaluates, and produces their own analysis output to their own file. No contention.

This is the read-heavy, write-light pattern. Most analysis work — code review, architecture evaluation, security audit, documentation generation — is reading and thinking. The actual output is small compared to the input. So you parallelize the reading and serialize the writing.

The practical implementation is straightforward. Each agent gets its own working directory or output file. They read from the shared source. They write to their own space. A coordinator collects the outputs when all agents are done.

Rule Three: Bounded Workers

The TTS pipeline has a worker pool. It spawns processes to generate audio segments. In theory, you could spawn 100 workers and process 100 segments at once. In practice, you have a GPU with 12.3 GB of VRAM, and each worker needs some of that memory.

Bounding workers is not optional. Without limits, agents will starve each other for resources.

In the narrator-tts pipeline, the worker pool spawns a configurable number of qbench-worker processes — usually two or three, depending on the GPU. Each worker gets a bounded allocation. The orchestrator assigns work in chunks, monitors heartbeats, and knows when a worker is stuck.

In Karl Code, the hourly dispatcher reads a task queue and spawns workers — but with a hard cap of two concurrent workers. If three tasks are queued, the third waits. This is not a limitation. It is a design decision. Two agents working well beats four agents fighting for resources.

The general principle: figure out your real bottleneck — GPU memory, CPU, API rate limits, context windows — and set your worker count based on that constraint. Not on the number of tasks waiting.

Rule Four: Files as the Coordination Layer

How do agents know what other agents have done? How does a worker know which segment to process next? How does the orchestrator know when a task is complete?

Files. The answer is always files.

The task queue is a JSON file. A worker reads the queue, picks the next task, marks it as in-progress, and writes the updated queue back. The manifest is a JSON file. Each worker writes its segment completion status. The orchestrator reads the manifest to know what is done.

If you are building multi-agent coordination, your first instinct should be files. Your second instinct should also be files. In the cloud, those files become blob storage — S3, Azure Blob, GCS — but the principle is the same. Databases, message queues, and IPC channels come later — much later — and only when files have proven insufficient.

Files have three properties that make them ideal for agent coordination:

They are readable by everything. Any agent, any script, any human can read a JSON file. No client library needed. No connection string. No authentication.

They are diffable. When something goes wrong — and it will — you can diff the file to see exactly what changed. Try doing that with the internal state of a running process.

They are durable. If the orchestrator crashes, the queue file is still there. The workers can be restarted. The state survives the process. This is the same principle I wrote about with cron jobs: stateless agents, persistent files.

Rule Five: Clear Role Boundaries

Karl Code has 12 built-in agent profiles. Each one has a defined role, a defined set of tools, and a defined safety level.

Profile What It Does What It Cannot Do
orchestrator Decomposes tasks, delegates, verifies Does not write code directly
code Implements features Does not modify tests
test Writes and runs tests Does not modify source
debug Investigates failures Read-only on source
verify Checks correctness No modifications at all
explore Reads codebase No writes, no tools

The boundaries are enforced at the tool level. The explore agent literally does not have write tools available. The verify agent cannot edit files. This is not a suggestion in the system prompt — it is a hard restriction in the tool configuration.

Role boundaries prevent the most common multi-agent failure: two agents unknowingly working on the same thing. When the code agent is the only one that can modify source files, you will never have the test agent accidentally changing implementation code while trying to fix a test.

Rule Six: Communication Through Contracts

Agents should not talk to each other freely. They should communicate through well-defined contracts.

In the TTS pipeline, the orchestrator sends a WorkAssignment message to a worker. The worker responds with WorkComplete or WorkFailed. That is the entire communication protocol. The worker does not ask the orchestrator questions. The orchestrator does not send mid-task corrections.

This is the JSON IPC protocol that runs over stdin/stdout between the orchestrator and worker processes. Every message is a typed struct. Every response is expected. No freeform conversation.

In Karl Code, the orchestrator delegates to a subagent with a task description. The subagent does the work and returns a result. The subagent does not call back mid-task to ask follow-up questions. If it needs information, it reads the files itself.

Constrained communication is reliable communication. The more channels you open between agents, the more failure modes you create.

When Coordination Breaks

I have hit every failure mode in this list at least once.

Two agents edited the same file. Before I enforced serialized writes in the orchestrator, the code agent and the test agent both modified a Python module. The test agent's write won, silently deleting the implementation. Tests passed because there was nothing to test against. I caught it in code review, not at runtime.

A worker hung forever. A TTS worker process hit an edge case in the audio library and sat there indefinitely, holding GPU memory. The orchestrator was waiting for its heartbeat response. Deadlock. Fix: heartbeat timeout. If a worker does not respond within 60 seconds, the orchestrator kills it and reassigns the work.

The queue file got corrupted. Two dispatchers were running (I had accidentally left a cron job enabled after testing). Both tried to update the task queue at the same time. JSON became malformed. Both agents crashed. Fix: file locking. One writer at a time, enforced at the filesystem level.

Each failure taught me a rule above. I did not design these rules upfront. I extracted them from failures.

The Meta-Pattern

If you zoom out, all six rules follow from one principle:

Multi-agent coordination is a solved problem if you treat agents like people working on a shared project. Give each person a clear role. Let them read freely. Serialize the writes. Keep the shared state in a format everyone can inspect. And do not let them talk over each other.

We know how to coordinate human teams. We have been doing it for centuries. The same patterns apply to AI agents. The difference is that agents are faster, more numerous, and less likely to ask for clarification when something is ambiguous — which makes the constraints even more important.

Start with files — or blob storage, or whatever durable store fits your deployment. Serialize your writes. Bound your workers. The rest follows.