The File Conflict Problem Nobody Warns You About in Multi-Agent Systems

When you go from one AI agent to three running in parallel, something breaks that no framework documentation mentions: they all share the same filesystem.

Your task orchestrator carefully assigns different tasks to different agents. Agent A works on the API. Agent B writes tests. Agent C updates documentation. Clean separation, right?

Wrong. Agent A modifies config.yaml as a side effect. Agent C also touches config.yaml because the docs reference configuration. Agent B’s test touches a shared fixture file. Nobody coordinated the file-level access because the orchestration layer only thinks about task-level assignment.

What actually happens

We discovered this running OpDek, an AI operations engine that dispatches work to specialized agents via Claude CLI. With max_concurrent set to 3, here’s what the system looked like:

Task claiming: atomic SQLite UPDATE prevents double-dispatch. Works perfectly.
Supervisor lock: prevents two scheduling cycles from racing. Works perfectly.
File isolation: nothing. Zero. Every agent runs claude -p in the same working directory.

The system worked by accident. Throughput was low enough that concurrent file edits rarely overlapped. But “rarely” is not “never,” and as we increased parallelism, the probability of collision approached 1.

The gap between task orchestration and file safety

This is a pattern we see across every multi-agent framework:

CrewAI: agents share a workspace. No file locking.
AutoGen: agents communicate via messages but share filesystem access.
LangGraph: workflow orchestration, not filesystem isolation.

Task-level coordination (who works on what) is a solved problem. File-level coordination (who can write to which files at the same time) is not even acknowledged as a problem in most architectures.

The database layer is fine. SQLite WAL mode handles concurrent reads. Atomic claim mechanisms prevent double-dispatch. But the moment agents start editing source files, JSON configs, or any shared resource on the filesystem, you’re back to the 1970s concurrency problem with none of the safeguards.

The fix: git worktrees as agent sandboxes

We solved this by giving each agent task its own git worktree:

Before execution: git worktree add --detach memory/worktrees/wt_{task_id} creates an isolated copy of the repo at HEAD
During execution: the agent’s CLI runs with cwd=worktree_path. It can read and write freely without affecting other agents or the main repo
After execution: check git status in the worktree. If files changed, commit on an agent/{task_id} branch and merge back to main
On conflict: preserve the branch, report the conflict. A supervisor (human or AI) resolves it
Cleanup: remove the worktree regardless of outcome

This is essentially the same pattern that CI systems use when running parallel test suites. Each runner gets its own checkout. The merge back is the serialization point.

Graceful degradation

If worktree creation fails (disk full, git issue), the agent falls back to the shared directory. No task is blocked by the isolation layer failing. This is critical: the safety mechanism should never be the thing that breaks your system.

What this means for throughput

With worktree isolation, you can safely increase max_concurrent because agents can no longer interfere with each other’s file changes. The only serialization point is the merge, which is fast and atomic. We went from max_concurrent: 2 (with crossed fingers) to max_concurrent: 3 (with actual safety), and there’s no architectural reason we can’t go higher.

Takeaways

If you’re building any system that runs multiple AI agents against the same codebase:

Task-level orchestration is not file-level safety. Your dispatcher knowing which agent has which task does not prevent them from editing the same files.
“It works because throughput is low” is not a design. It’s a coincidence that will eventually fail.
Git worktrees are the right primitive. They’re lightweight (shared object store), well-tested, and provide real filesystem isolation with a clean merge path.
Build the isolation layer to be optional. If it fails, fall back. If it succeeds, use it. Never let the safety system be a single point of failure.

This is the kind of problem you only discover when you actually run multi-agent systems in production. Demos with one agent don’t surface it. Benchmarks don’t surface it. Only sustained parallel execution against real files does.

This post is part of the build log for OpDek, an AI operations engine. Follow along at dxdev.com/blog.