The Zombie That Blocked Everything
There’s a class of failure mode in automated systems that I’ve started calling the zombie problem.
A zombie isn’t an error. It doesn’t throw an exception. It doesn’t fail loudly. It just… persists. Status: in_review. Technically alive. Functionally dead. And quietly rotting in the queue while everything downstream piles up behind it.
We ran into one. It took us a while to realize the zombie was the whole problem.
When we moved to a multi-agent architecture, we introduced a proposal system. The idea was simple and good: any agent that noticed a systemic improvement could submit a formal proposal — a structured recommendation with a problem, solution, impact, and cost estimate. A designated reviewer (in our case, the CTO role) would evaluate it and either promote, reject, or defer it.
Six agents could submit proposals. One agent reviewed them. We didn’t think much about that ratio at the time.
One proposal — P-1 — was flagged for CTO review. It concerned the proposal system itself. P-1 went into review. And stayed there. Not rejected. Not deferred. Not approved. Just: in_review. A zombie had entered the pipeline.
One morning the watchdog fired 17 alerts in a single sweep. Seventeen alerts. One root cause. The zombie wasn’t creating 17 problems — it was creating 1 problem that manifested in 17 places. Then someone submitted a proposal to fix the review bottleneck. It went into the CTO review queue. Behind P-1. We had a meta-deadlock: the proposal to fix the proposal system was stuck in the same broken queue it was trying to fix.
The fix took less than an afternoon. The learning — that governance structures need their own escape hatches — is permanent.
Goal
Fix the proposal review bottleneck that was generating 17 cascading alerts and preventing all system improvements from landing. Motivation: the watchdog was drowning real signals in noise from a single stuck proposal.
Where We Are
graph LR
A[6 Agents] -->|submit| B[Proposal Queue]
B -->|route| C[CTO Review]
C -->|approve/reject| D[Execution]
C -.->|stuck here| E[Zombie P-1]
E -.->|blocks| B
F[Watchdog] -->|monitors| B
F -->|monitors| C
F -->|17 alerts| G[Alert Storm]
style B fill:#ff6b6b,stroke:#333
style C fill:#ff6b6b,stroke:#333
style E fill:#666,stroke:#333,color:#fff
This post focuses on the proposal queue and review pipeline (red). The zombie (grey) sat in CTO review and blocked everything downstream.
Problems Encountered
- No review timeout — proposals could sit
in_reviewindefinitely with no escalation - Single reviewer bottleneck — 6:1 agent-to-reviewer ratio with no overflow path
- Meta-deadlock — a proposal to fix proposals queued behind the broken proposal
- Alert cascade — 17 alerts from 1 root cause, masking the real problem
- No bypass for structural failures — the governance layer had no fire exit
Resolution
Added three mechanisms:
1. Meta-proposal detection — proposals about the proposal system itself are now auto-rejected at the creation gate with guidance to fix the system directly. Uses keyword matching against title + problem text.
_META_KEYWORDS = re.compile( r"\b(proposal queue|proposal system|approval queue|approval workflow|" r"approval gate|meta-proposal|too many proposals)\b", re.IGNORECASE,)2. CEO-level override — when the CEO agent reviews a proposal, it auto-converts to status='approved' and exits the queue immediately. No more sitting in pipeline limbo.
# CEO approved: mark as approved so it exits the review queueif next_level == "ceo_reviewed": db.approve_proposal(proposal_id)3. Auto-approve gate for safe proposals — low-risk tactical proposals skip the full review pipeline entirely. Criteria: type is tactical, confidence >= 60%, low cost, no protected files, no external actions (deploy, publish, etc.).
4. Alert deduplication — watchdog now deduplicates alerts by (task_id + title) key so cascading failures from a single root cause don’t generate 17 separate alerts. TTL-based expiry (2h default) auto-resolves stale alerts.
To test zombie detection locally:
python3 tools/monitor/watchdog.py --dry-run --verboseDependencies
- Python 3.10+
- SQLite with the
proposalsandproposal_reviewstables - Watchdog cron (runs every 5 min via
memory/crons.json) - CEO strategic loop (
tools/dispatch/ceo_strategic_loop.py) — runs every 2h - Agent role config in
config/agents.yamlfor hierarchical review routing
Deep Dive
The Proposal Pipeline
Proposals flow through a hierarchical review system introduced in commit 72a8143:
submitted → dept_reviewed → ceo_reviewed → board_presentedThe creation gate runs three checks before a proposal enters the queue:
- Dedup — 65% similarity check against existing proposals
- Throttle — max 5 proposals per agent per hour
- Meta-detection — rejects proposals about the proposal system itself
The schema stores review state at two levels:
-- Proposal status tracks lifecyclestatus TEXT DEFAULT 'pending' -- draft | pending | promoted | approved | rejected
-- Review level tracks pipeline positionreview_level TEXT DEFAULT 'submitted' -- submitted | dept_reviewed | ceo_reviewed | board_presentedHealth Check Architecture
The watchdog (tools/monitor/watchdog.py) runs 6 categories of health checks:
- Zombie sessions —
status='running'butcompleted_atis set → auto-fix to complete - Ghost sessions — no heartbeat >30 min → mark abandoned (skips interactive sessions)
- Stuck tasks — running >1h without heartbeat → alert at 1h, auto-fail at 6h
- Stale dispatches — stuck in
sent>5 min → auto-requeue (max 3 retries) - Cron health — critical crons stale >3 min → auto-restart cron runner
- Post-completion mutations — workstreams added after session end → corruption warning
The dispatch health layer (documented in ADR-0006) splits checks into hard blocks and soft warnings. Hard blocks always abort: capacity exceeded, backlog >3, success rate <70%. Soft warnings log and proceed. P0/P1 dispatches bypass all soft warnings.
The Alert Cascade Explained
The 17 alerts traced back to one zombie:
- Proposal queue depth → over threshold (zombie blocking queue)
- Proposal throughput → near zero (nothing clearing review)
- CTO review latency → extreme (zombie sitting in review)
- Agent proposal backlog → 6 agents, 0 resolutions
- Dispatch health degraded → downstream of proposal backlog
- System improvement rate → stalled 7-17. Cascade through dependent health checks
All shared the same or linked task_id in memory/alerts.json, which is how we traced them to one root cause. The dedup fix ensures this cascade now surfaces as 1 grouped alert.
Key Commits
| Commit | Date | What |
|---|---|---|
72a8143 | Feb 20 | Hierarchical proposal routing introduced |
ebf7af1 | Mar 22 | Watchdog: auto-fix zombie/ghost sessions |
42bc65e | Mar 22 | Fix duplicate review dispatches |
4540c6f | Apr 8 | Auto-approve gate for safe tactical proposals |
a6aac9d | Apr 9 | CEO-approved proposals auto-resolve |
The Pattern
If you’re building a system where agents can propose changes through a review queue, ask yourself before you ship:
What happens when the reviewer becomes the bottleneck?
Single-reviewer queues are a bus factor problem. In human systems, we manage this with vacations, backup approvers, and escalation paths. In automated systems, it’s easy to skip this because the reviewer is “always available.” But software can be busy, defer, or stall. When it does, the queue behind it is a graveyard of good ideas — all technically in_review, all zombies.
Every gate needs a fire exit. When you introduce a governance layer into an autonomous system, you’re adding overhead in exchange for control. The problem is when the control mechanism itself becomes uncontrollable. The system wasn’t broken. It was doing exactly what it was designed to do. The design was the problem.
Visual Summary
Infographic coming soon.
Part 5 of The Timeline — the true story of building an AI operations engine, backed by git history and real incidents.
Previous: The Alert Storm That Wasn’t | Next: TBD