Phase 76: API Robustness¶
Status: Complete Block: 8 (Consequence Enforcement & Scenario Expansion) Tests: 25 new (3 test files)
Summary¶
Phase 76 addresses Block 8 exit criteria #7 (API schemas current) and #8 (API concurrency bugs fixed). It fixes 6 critical/high concurrency bugs, adds graceful shutdown, WAL mode, filesystem scan caching, request body limits, and health probe endpoints.
What Was Built¶
76a: Concurrency Fixes¶
- Batch semaphore:
_execute_batch()now acquiresself._semaphorebefore eachrun_in_executorcall, respecting themax_concurrentlimit - Analysis semaphore: Lazily-initialized
asyncio.Semaphore(2)wraps/analysis/compareand/analysis/sweependpoints - Per-client WebSocket multicast:
_progress_queueschanged fromdict[str, Queue]todict[str, list[Queue]]. Newsubscribe()/unsubscribe()API replacesget_progress_queue(). All progress pushes iterate subscriber list.QueueFullon one subscriber doesn't block others. - Tempfile to thread pool:
tempfile.mkdtemp()in/runs/from-confignow runs viaasyncio.to_thread()
76b: Graceful Shutdown & Reliability¶
- Graceful shutdown:
RunManager.shutdown()method sets all cancel flags, waits with timeout, cancels remaining tasks. Called from ASGI lifespan cleanup. - Database hardening: WAL mode +
busy_timeout=5000PRAGMAs ininitialize(). Migration errors logged (not silently swallowed).assertreplaced withRuntimeErrorin.connproperty. - Scan caching:
_ScanCacheclass inapi/scenarios.pywith mtime-based invalidation.scan_scenarios()andscan_units()now cache results until directory mtime changes.invalidate_cache()for tests.
76c: Request Safety¶
- Schema validation:
Field(ge=1, le=1_000_000)onmax_ticksacross all request schemas.Field(ge=1, le=1_000)onnum_iterations(batch),Field(ge=1, le=500)onnum_iterations(compare/sweep).Field(max_length=50)on sweepvalues._check_dict_depth()validator onconfig_overridesand inlineconfig(max depth 5, max 200 keys per level).ConfigDict(str_max_length=100_000)on all request schemas. - Health endpoints:
/health/live(instant 200, no external checks) and/health/ready(DB connectivity + cached scenario/unit counts). Existing/healthpreserved (now fast due to scan caching). - New response models:
HealthLiveResponse,HealthReadyResponse
Files Modified¶
| File | Changes |
|---|---|
api/schemas.py |
ConfigDict, Field constraints, _check_dict_depth() validator, HealthLiveResponse/HealthReadyResponse models |
api/database.py |
WAL mode, busy_timeout=5000, migration error logging, assert replaced with RuntimeError |
api/scenarios.py |
_ScanCache class, invalidate_cache(), cached scan_scenarios()/scan_units() |
api/run_manager.py |
Multicast queues (subscribe()/unsubscribe()), batch semaphore enforcement, shutdown() method |
api/routers/runs.py |
subscribe()/unsubscribe() in WebSocket handlers, asyncio.to_thread() for tempfile |
api/routers/analysis.py |
Analysis concurrency semaphore |
api/routers/meta.py |
/health/live + /health/ready endpoints |
api/main.py |
Graceful shutdown in lifespan cleanup |
New Test Files¶
| File | Tests |
|---|---|
tests/api/test_concurrency.py |
11 |
tests/api/test_reliability.py |
8 |
tests/api/test_request_safety.py |
6 |
| Total | 25 |
Design Decisions¶
-
Multicast uses list of queues per run_id: Simpler than a pub/sub pattern — no external dependencies, straightforward iteration over subscriber list. Each subscriber gets an independent queue.
-
Analysis semaphore is independent from RunManager semaphore: Analysis endpoints use a different thread pool and have different resource characteristics — separate rate limiting is appropriate.
-
WAL mode safe on :memory: databases: Returns
"memory"with no side effects, so the test suite (which uses in-memory SQLite) is unaffected. -
No signal handlers needed: uvicorn's ASGI lifespan handles SIGTERM/SIGINT cross-platform. The
shutdown()method hooks into the lifespan cleanup phase. -
Scan cache uses directory mtime, not individual file mtimes: A single
os.stat()call per directory is fast and sufficient — any file add/remove/rename changes the directory's mtime. Good enough for invalidation without scanning individual files.
Postmortem¶
Scope¶
On target. All 3 substeps (concurrency, reliability, request safety) delivered. 25 tests cover the key behavioral changes.
Integration¶
Fully wired. All concurrency fixes are in existing API paths — no new modules. WebSocket multicast is backward-compatible (single client still works). Health endpoints added to existing router. Scan caching is transparent to all callers.
Quality¶
- Concurrency tests validate semaphore limits and multicast isolation
- Reliability tests verify WAL mode, shutdown behavior, and cache invalidation
- Request safety tests exercise field constraints and depth validation
- No behavioral changes to simulation engine — purely API layer
Deficits¶
0 new deficits. API-layer changes only, no simulation engine impact.
Performance¶
- Scan caching eliminates redundant filesystem walks on
/healthand scenario listing endpoints - WAL mode improves concurrent read performance under write load
- 25 new tests run in <1s. No regression suite performance impact.