Making Tests Finish: CI Reliability in Agentic Systems

A few days ago, our test suite stopped finishing. Not flaking — hanging. GitHub Actions runners would timeout, leaving us with no signal about whether the build was healthy. For a system managing autonomous agents and dispatch pipelines, a broken CI is a broken feedback loop.

The Hangover Problem

In a system orchestrating multiple agents, the test suite mirrors that complexity. We run tests for the dispatch pipeline, the resolver, the agent supervisor, and the Jira sync engine. Many of these components make network calls or manage async state. In CI, with no running server and no actual network, those calls become traps.

The symptoms were clear in the logs: tests would start, then silence. No failure, no timeout message, just the runner counting down to forced shutdown. When that happens, you get nothing — no diagnostics, no idea which test hung, sometimes no ability to reproduce locally.

Diagnosis: The Three Culprits

Over a day of commits, three patterns emerged:

Fixture imports that block. The tagger service imports opdek_config, which in CI isn’t available. The import doesn’t fail cleanly — it hangs during test discovery. Solution: make that import optional, with a guard for CI environments.

Unguarded network calls. The supervisor tests call urlopen() without mocking. The run_cycle engine calls check_stalls(), which hits the network. In isolation, each one waits indefinitely. Solution: mock them globally in the fixture setup.

No timeout backstop. Even with mocks in place, a hanging test had no hard limit. A single test could consume the entire 30-minute runner timeout, blocking any feedback. Solution: add a 2-second global socket timeout and skip integration tests that require a running server.

The Fix: Layered Defenses

Rather than one big refactor, we applied defenses in layers:

Global socket timeout (conftest.py): Set a 2-second timeout on all socket operations. This stops indefinite waits in a single line of configuration.
Fixture guards (tagger.py): Wrap the opdek_config import in a try/except. If it fails, the test still runs — it just doesn’t validate against the actual config.
Network mocks (supervisor_test.py, run_cycle_test.py): Mock urlopen() and check_stalls() at the fixture level, before any test runs. Every subtest inherits the mocks.
Selective skipping (conftest.py): Skip tests that require a running server or are known to deadlock. This is pragmatic: a skipped test is better than a hung one, and we skip with a reason: @pytest.mark.skip(reason="requires running API server").
Syntax verification (conftest.py): Guard DB loading and agent YAML parsing in try/except blocks. If the database is locked or config malformed, the test fixture fails fast with an error message, not a hang.

These aren’t elegant, but they’re honest. They acknowledge that CI is a hostile environment — no running server, network off, fixtures loaded in parallel — and that hanging tests are worse than no tests.

Outcome: Monitoring, Not Just Fixes

Once tests reliably finish, we added the final piece: visibility. A new GitHub Actions monitor alerts whenever a test run fails or times out. This gives us immediate signal — no more wondering if the build is healthy.

The commit that wired this together also consolidated recent work: Jira integration, multi-tenant config, and autonomous dispatch. All of that now runs through a test suite that actually finishes.

The Pattern: Defensive Test Architecture

Here’s what works in systems with async, network, and distributed state:

Timeouts are not optional. They’re not just for long-running tests — they’re your backstop against silent failure. A 2-second global timeout catches hangs that would otherwise balloon to runner shutdown.
Mock early, mock broadly. Don’t wait for a test to fail on the network call. Mock network operations at fixture setup, before test discovery. This makes CI fast and deterministic.
Guard imports, not just code. In complex systems, slow imports or missing config can hang test discovery itself. Defensive imports with fallbacks make this explicit.
Skip with reason, not silence. A skipped test is better than a hung one, but only if future readers know why it was skipped. Use @pytest.mark.skip() with a clear reason: “requires running API server”, not just “disable this for now”.
Monitor the monitor. Adding tests is only half the job. Wire up alerts for when the test run itself fails. That feedback loop is what catches the next hang before it ships.

The goal isn’t perfection — it’s a test suite that finishes, so we can actually see whether the system works.

Part 3 of The Repo — patterns and references that survived contact with reality.