The Zombie That Blocked Everything

There’s a class of failure mode in automated systems that I’ve started calling the zombie problem.

A zombie isn’t an error. It doesn’t throw an exception. It doesn’t fail loudly. It just… persists. Status: in_review. Technically alive. Functionally dead. And quietly rotting in the queue while everything downstream piles up behind it.

We ran into one. It took us a while to realize the zombie was the whole problem.

When we moved to a multi-agent architecture, we introduced a proposal system. The idea was simple and good: any agent that noticed a systemic improvement could submit a formal proposal — a structured recommendation with a problem, solution, impact, and cost estimate. A designated reviewer (in our case, the CTO role) would evaluate it and either promote, reject, or defer it.

Six agents could submit proposals. One agent reviewed them. We didn’t think much about that ratio at the time.

One proposal — P-1 — was flagged for CTO review. It concerned the proposal system itself. P-1 went into review. And stayed there. Not rejected. Not deferred. Not approved. Just: in_review. A zombie had entered the pipeline.

One morning the watchdog fired 17 alerts in a single sweep. Seventeen alerts. One root cause. The zombie wasn’t creating 17 problems — it was creating 1 problem that manifested in 17 places. Then someone submitted a proposal to fix the review bottleneck. It went into the CTO review queue. Behind P-1. We had a meta-deadlock: the proposal to fix the proposal system was stuck in the same broken queue it was trying to fix.

The fix took less than an afternoon. The learning — that governance structures need their own escape hatches — is permanent.


Goal

Fix the proposal review bottleneck that was generating 17 cascading alerts and preventing all system improvements from landing. Motivation: the watchdog was drowning real signals in noise from a single stuck proposal.

Where We Are

graph LR
    A[6 Agents] -->|submit| B[Proposal Queue]
    B -->|route| C[CTO Review]
    C -->|approve/reject| D[Execution]
    C -.->|stuck here| E[Zombie P-1]
    E -.->|blocks| B
    F[Watchdog] -->|monitors| B
    F -->|monitors| C
    F -->|17 alerts| G[Alert Storm]

    style B fill:#ff6b6b,stroke:#333
    style C fill:#ff6b6b,stroke:#333
    style E fill:#666,stroke:#333,color:#fff

This post focuses on the proposal queue and review pipeline (red). The zombie (grey) sat in CTO review and blocked everything downstream.

Problems Encountered

  • No review timeout — proposals could sit in_review indefinitely with no escalation
  • Single reviewer bottleneck — 6:1 agent-to-reviewer ratio with no overflow path
  • Meta-deadlock — a proposal to fix proposals queued behind the broken proposal
  • Alert cascade — 17 alerts from 1 root cause, masking the real problem
  • No bypass for structural failures — the governance layer had no fire exit

Resolution

Added three mechanisms:

1. Meta-proposal detection — proposals about the proposal system itself are now auto-rejected at the creation gate with guidance to fix the system directly. Uses keyword matching against title + problem text.

_META_KEYWORDS = re.compile(
r"\b(proposal queue|proposal system|approval queue|approval workflow|"
r"approval gate|meta-proposal|too many proposals)\b",
re.IGNORECASE,
)

2. CEO-level override — when the CEO agent reviews a proposal, it auto-converts to status='approved' and exits the queue immediately. No more sitting in pipeline limbo.

# CEO approved: mark as approved so it exits the review queue
if next_level == "ceo_reviewed":
db.approve_proposal(proposal_id)

3. Auto-approve gate for safe proposals — low-risk tactical proposals skip the full review pipeline entirely. Criteria: type is tactical, confidence >= 60%, low cost, no protected files, no external actions (deploy, publish, etc.).

4. Alert deduplication — watchdog now deduplicates alerts by (task_id + title) key so cascading failures from a single root cause don’t generate 17 separate alerts. TTL-based expiry (2h default) auto-resolves stale alerts.

To test zombie detection locally:

Terminal window
python3 tools/monitor/watchdog.py --dry-run --verbose

Dependencies

  • Python 3.10+
  • SQLite with the proposals and proposal_reviews tables
  • Watchdog cron (runs every 5 min via memory/crons.json)
  • CEO strategic loop (tools/dispatch/ceo_strategic_loop.py) — runs every 2h
  • Agent role config in config/agents.yaml for hierarchical review routing

Deep Dive

The Proposal Pipeline

Proposals flow through a hierarchical review system introduced in commit 72a8143:

submitted → dept_reviewed → ceo_reviewed → board_presented

The creation gate runs three checks before a proposal enters the queue:

  1. Dedup — 65% similarity check against existing proposals
  2. Throttle — max 5 proposals per agent per hour
  3. Meta-detection — rejects proposals about the proposal system itself

The schema stores review state at two levels:

-- Proposal status tracks lifecycle
status TEXT DEFAULT 'pending' -- draft | pending | promoted | approved | rejected
-- Review level tracks pipeline position
review_level TEXT DEFAULT 'submitted' -- submitted | dept_reviewed | ceo_reviewed | board_presented

Health Check Architecture

The watchdog (tools/monitor/watchdog.py) runs 6 categories of health checks:

  • Zombie sessionsstatus='running' but completed_at is set → auto-fix to complete
  • Ghost sessions — no heartbeat >30 min → mark abandoned (skips interactive sessions)
  • Stuck tasks — running >1h without heartbeat → alert at 1h, auto-fail at 6h
  • Stale dispatches — stuck in sent >5 min → auto-requeue (max 3 retries)
  • Cron health — critical crons stale >3 min → auto-restart cron runner
  • Post-completion mutations — workstreams added after session end → corruption warning

The dispatch health layer (documented in ADR-0006) splits checks into hard blocks and soft warnings. Hard blocks always abort: capacity exceeded, backlog >3, success rate <70%. Soft warnings log and proceed. P0/P1 dispatches bypass all soft warnings.

The Alert Cascade Explained

The 17 alerts traced back to one zombie:

  1. Proposal queue depth → over threshold (zombie blocking queue)
  2. Proposal throughput → near zero (nothing clearing review)
  3. CTO review latency → extreme (zombie sitting in review)
  4. Agent proposal backlog → 6 agents, 0 resolutions
  5. Dispatch health degraded → downstream of proposal backlog
  6. System improvement rate → stalled 7-17. Cascade through dependent health checks

All shared the same or linked task_id in memory/alerts.json, which is how we traced them to one root cause. The dedup fix ensures this cascade now surfaces as 1 grouped alert.

Key Commits

CommitDateWhat
72a8143Feb 20Hierarchical proposal routing introduced
ebf7af1Mar 22Watchdog: auto-fix zombie/ghost sessions
42bc65eMar 22Fix duplicate review dispatches
4540c6fApr 8Auto-approve gate for safe tactical proposals
a6aac9dApr 9CEO-approved proposals auto-resolve

The Pattern

If you’re building a system where agents can propose changes through a review queue, ask yourself before you ship:

What happens when the reviewer becomes the bottleneck?

Single-reviewer queues are a bus factor problem. In human systems, we manage this with vacations, backup approvers, and escalation paths. In automated systems, it’s easy to skip this because the reviewer is “always available.” But software can be busy, defer, or stall. When it does, the queue behind it is a graveyard of good ideas — all technically in_review, all zombies.

Every gate needs a fire exit. When you introduce a governance layer into an autonomous system, you’re adding overhead in exchange for control. The problem is when the control mechanism itself becomes uncontrollable. The system wasn’t broken. It was doing exactly what it was designed to do. The design was the problem.

Visual Summary

Infographic coming soon.


Part 5 of The Timeline — the true story of building an AI operations engine, backed by git history and real incidents.

Previous: The Alert Storm That Wasn’t | Next: TBD