Phase 57: Full Validation & Regression¶
Summary¶
Phase 57 is the final phase of Block 6 and the capstone of 57 phases of development. It validates that all scenarios produce correct historical outcomes, verifies every calibration parameter is consumed and exercised, closes every deficit in the project with a formal disposition, fixes an OPERATIONAL resolution deadlock introduced by the Phase 55a closing range guard, migrates checkpoint serialization from pickle to JSON, and synchronizes all documentation.
Key outcomes: - All 37 scenarios complete without error - MC threshold tightened from 60% to 80% (N=10 seeds) - Every CalibrationSchema field has at least one Python consumer and at least one test/scenario exercising it - Zero unresolved deficits — all 169 items in devlog/index.md have dispositions (134 resolved, 34 accepted limitations, 1 dormant capability) - OPERATIONAL resolution deadlock discovered and fixed (forces frozen between 15-30km) - Checkpoint serialization migrated from pickle to JSON (NumpyEncoder, legacy pickle fallback) - 6 scenario recalibrations following engine fix - All documentation synchronized via cross-doc audit
What Was Built¶
57a: Full Scenario Evaluation¶
- Tightened MC thresholds:
MIN_CORRECT_FRACTION0.6 -> 0.8,N_SEEDS5 -> 10 inTestHistoricalAccuracyMC - Victory condition tests:
TestVictoryConditions— 13 decisive combat scenarios verified to not resolve viatime_expired - Module-scoped eval fixture: Single evaluator run shared across test classes — avoids redundant full-scenario evaluations
- Scenario coverage assertion:
test_all_scenarios_in_regression— verifies every scenario YAML is tracked inHISTORICAL_WINNERSorDRAW_SCENARIOS - YAML load validation:
test_all_scenarios_load_cleanly— parametrized test loading all scenario YAMLs - falklands_campaign added to
HISTORICAL_WINNERS(blue) - Victory conditions added to 4 scenarios:
73_easting,bekaa_valley_1982,gulf_war_ew_1991,falklands_naval— these previously lacked explicit victory_conditions in YAML
57b: Calibration Parameter Coverage¶
New test file tests/validation/test_calibration_coverage.py with 8 tests:
1. test_all_fields_have_consumers — every CalibrationSchema field referenced in Python source
2. test_morale_subfields_covered — all MoraleCalibration fields consumed
3. test_side_override_fields_covered — all SideCalibration fields consumed
4. test_all_fields_exercised — each field set by at least one scenario or test
5. test_dead_key_list_minimal — _DEAD_KEYS contains only advance_speed
6. test_schema_round_trip — CalibrationSchema serializes/deserializes cleanly
7. test_no_orphan_cal_get_keys — no cal.get() call uses a key not in the schema
8. test_calibration_defaults_match_original — defaults match pre-schema hardcoded values
57c: Zero-Deficit Audit¶
All items in devlog/index.md Post-MVP Refinement Index given formal dispositions:
Newly marked as resolved (8 items): - Phase 7: Pre-scripted behavior -> resolved Phase 8 (OODA FSM) - Phase 10: Fire rate limiting (inherited) -> resolved Phase 11a - Phase 10: Wave attack modeling (inherited) -> resolved Phase 11c - Phase 28.5: No dew_config scenario -> resolved (taiwan_strait uses dew_config) - Phase 37: 8 legacy scenarios -> resolved Phase 55b - Phase 37: DEW always destroys -> resolved Phase 51c (dew_disable_threshold) - Phase 42: Suwalki/Taiwan stall -> resolved Phase 55a - Phase 47: Naval phantom references -> resolved Phase 51a
Additional resolved items (from 57d/57e): - Checkpoint pickle serialization -> resolved Phase 57e (JSON migration) - Test coverage for checkpoint -> resolved Phase 57e (10 new tests) - Naval campaign resolution -> resolved Phase 57d (OPERATIONAL movement fix) - Weather evolution -> resolved (Markov weather already evolves per-tick) - Campaign EW integration -> resolved (EW engines called in campaign loop) - Cover double-count concern -> resolved (verified single application)
Marked as accepted limitations (34 items), grouped by category: - Architectural design decisions (12): Environment threading, single-threaded loop, simplified strategic movement, fixed reinforcement schedule, wave assignments manual, etc. - Domain-specific limitations (12): No DRFM, simplified TDOA, no cooperative jamming, Keplerian orbits, no satellite bus, debris statistical, convoy abstract, etc. - Combat model nuance (4): HEAT range-independent, carrier ops abstract, VLS/mine scenario, DEW ADUnitType routing - UI/cosmetic (5): Frame data redundancy, TEXT blobs, keyboard shortcuts, Plotly dark template, dark terrain palette - AI design decisions (1): COA analytical
Dormant capabilities (1): - Stratagems opportunity-evaluated (working as designed, may evolve to COA-planned in future)
57d: OPERATIONAL Resolution Deadlock Fix¶
Bug discovered: The Phase 55a closing range guard created a deadlock for forces between 15-30km apart:
- At STRATEGIC resolution, _forces_within_closing_range() detected forces within 30km and switched to OPERATIONAL
- At OPERATIONAL resolution, no battles were created (forces not close enough for tactical engagement)
- Without battles, no tactical movement occurred — forces were frozen
- update_strategic() was only called at STRATEGIC resolution, so units couldn't advance to close the gap
Fix: In stochastic_warfare/simulation/engine.py, run update_strategic() at OPERATIONAL resolution when no active battles exist. This allows forces to continue advancing toward each other under OPERATIONAL tick timing (300s) until they are close enough for TACTICAL engagement.
6 scenario recalibrations following the fix (outcomes changed due to forces now properly closing):
- bekaa_valley_1982 — calibration adjusted
- gulf_war_ew_1991 — calibration adjusted
- falklands_naval — calibration adjusted
- korean_peninsula — calibration adjusted
- 73_easting — calibration adjusted
- taiwan_strait — calibration adjusted
57e: Checkpoint Migration (Pickle to JSON)¶
Motivation: Pickle serialization is a security risk (arbitrary code execution on deserialize) and is fragile across Python version upgrades.
Changes to stochastic_warfare/core/checkpoint.py:
- New NumpyEncoder class — handles np.ndarray, np.integer, np.floating, np.bool_ for JSON serialization
- save_checkpoint() now writes JSON instead of pickle
- load_checkpoint() reads JSON by default, with legacy pickle fallback (detects binary vs JSON)
- All numpy types round-trip correctly through JSON
10 new tests in tests/unit/test_checkpoint.py:
- JSON save/load round-trip
- NumpyEncoder handles arrays, integers, floats, booleans
- Legacy pickle files still loadable (fallback path)
- Checkpoint with complex nested state
- Checkpoint overwrite behavior
- Error handling for corrupt files
57f: Phase 55 Test Update¶
- Korean Peninsula scenario ROE changed from
WEAPONS_TIGHTtoWEAPONS_FREEintests/unit/test_phase55_resolution_scenarios.py
Files Changed¶
| File | Change |
|---|---|
stochastic_warfare/simulation/engine.py |
OPERATIONAL resolution movement fix |
stochastic_warfare/core/checkpoint.py |
JSON serialization with NumpyEncoder, legacy pickle fallback |
tests/validation/test_historical_accuracy.py |
Module-scoped eval fixture, victory conditions, scenario coverage |
tests/validation/test_calibration_coverage.py |
New — 8 calibration coverage tests |
tests/validation/test_deficit_closure.py |
New — 7 structural verification tests |
tests/unit/test_checkpoint.py |
10 new checkpoint tests |
tests/unit/test_phase55_resolution_scenarios.py |
Korean Peninsula ROE fix |
data/scenarios/73_easting/scenario.yaml |
Victory conditions + recalibration |
data/scenarios/bekaa_valley_1982/scenario.yaml |
Victory conditions + recalibration |
data/scenarios/gulf_war_ew_1991/scenario.yaml |
Victory conditions + recalibration |
data/scenarios/falklands_naval/scenario.yaml |
Victory conditions + recalibration |
data/scenarios/korean_peninsula/scenario.yaml |
Recalibration |
data/scenarios/taiwan_strait/scenario.yaml |
Recalibration |
docs/devlog/index.md |
Deficit updates |
Design Decisions¶
- 80% MC threshold is achievable: Historical calibration across Phases 47-56 produced reliable outcomes. 80% at N=10 provides strong statistical confidence.
- Accepted limitation annotations inline: Each limitation annotated directly in the devlog/index.md table rather than in a separate file, keeping the single source of truth.
- Structural verification over integration tests: Deficit closure tests verify code paths exist (string search) rather than running full scenarios, keeping tests fast.
- falklands_campaign tracked as blue winner: The scenario has
time_expiredwithside: blue— if the British survive the campaign duration, they win. - OPERATIONAL movement fix is minimal: Rather than rearchitecting resolution switching, the fix adds
update_strategic()at OPERATIONAL when no battles are active — single conditional, zero side effects for TACTICAL/STRATEGIC. - JSON over pickle for checkpoints: Security (no arbitrary code execution), portability (human-readable, cross-version), debuggability. Legacy pickle fallback ensures backward compatibility.
- Module-scoped eval fixture: Running the full scenario evaluator is expensive; sharing the result across test classes in the same module avoids redundant computation.
Deficit Resolution¶
8 previously-resolved items marked with strikethrough + phase citation in devlog/index.md. 6 additional items resolved during 57d/57e work. 34 remaining items given "accepted limitation" disposition with rationale. 1 item marked as dormant capability.
Final deficit inventory: 169 total items, 134 resolved, 34 accepted limitations, 1 dormant capability, 0 unresolved.
Test Summary¶
| Substep | Tests | Description |
|---|---|---|
| 57a | 17 | Victory conditions (13), coverage (2), YAML load (1), falklands_campaign (1) |
| 57b | 8 | Calibration field coverage, consumers, round-trip, orphan keys |
| 57c/d (deficit closure) | 7 | DEW, naval, fire rate, closing range, schema typed |
| 57d (engine fix) | 9 | Scenario recalibrations, OPERATIONAL movement tests |
| 57e (checkpoint) | 10 | JSON round-trip, NumpyEncoder, legacy fallback |
| Total | 51 |
Lessons Learned¶
- Deficit disposition is fast when categories are clear: Architectural decisions, domain limitations, and cosmetic items each have obvious rationale. The audit took minutes, not hours.
- CalibrationSchema with extra="forbid" is the best defense against drift: The
test_no_orphan_cal_get_keystest catches any code using keys not in the schema — prevents the silent-failure pattern that plagued pre-Phase-49 calibration. - Structural tests catch regressions cheaply: Verifying that code paths exist via string search is 100x faster than running the scenario evaluator, and catches the same class of wiring bugs.
- Resolution guard bugs hide behind distance: The OPERATIONAL deadlock only manifested when forces started 15-30km apart — close enough to trigger OPERATIONAL but too far for TACTICAL engagement. Scenarios with forces closer or farther apart worked fine.
- Pickle to JSON migration is worth the effort: The NumpyEncoder handles all edge cases (arrays, integer subtypes, boolean). Legacy fallback means zero migration cost for existing checkpoints.
- Module-scoped fixtures save significant test time: The scenario evaluator takes seconds per scenario; sharing across test classes in one module eliminates redundant runs.
Postmortem¶
1. Delivered vs Planned¶
- Planned: 57a (scenario eval), 57b (calibration coverage), 57c (zero-deficit audit), 57d (doc sync). ~25 tests.
- Delivered: All planned + 57d (OPERATIONAL resolution deadlock fix), 57e (checkpoint JSON migration), 57f (Phase 55 test fix). 51 tests.
- Scope: Over — two significant unplanned items (engine deadlock, checkpoint migration) were the right call. Deadlock was a showstopper; checkpoint migration was user-requested security hardening.
2. Integration Audit¶
- All new code wired: NumpyEncoder used by create_checkpoint(), object_hook used by restore_checkpoint(), OPERATIONAL fix in engine tick loop
- No dead modules or orphan imports
- Verdict: Fully wired
3. Test Quality Review¶
- Integration: test_historical_accuracy.py runs full scenario evaluation subprocess (true E2E)
- Structural: test_deficit_closure.py verifies code paths via string search (fast, appropriate for existence checks)
- Unit: checkpoint tests cover NumpyEncoder in isolation + full round-trip with PRNG continuity
- Edge cases: legacy pickle fallback, numpy type handling, corrupt file paths
- Slow tests properly marked @pytest.mark.slow
- Verdict: High quality
4. API Surface Check¶
- checkpoint.py: all public functions typed, private prefix on internal helpers, get_logger(name)
- engine.py: 2-line addition within existing private method, no new API surface
- Verdict: Clean
5. Deficit Discovery¶
- 0 new deficits
- Two calibration fields (wind_accuracy_penalty_scale, rain_attenuation_factor) consumed but not exercised by scenarios — tracked in _CONSUMED_BUT_UNEXERCISED_FIELDS, not a test gap
6. Documentation Freshness¶
- Test counts updated to 8,383 Python / 8,655 total across CLAUDE.md, README.md, docs/index.md, MEMORY.md, phase-57.md
- Phase 57 in mkdocs.yml nav (line 140)
- Phase 57 in devlog/index.md table (line 73)
- All deficit dispositions inline in devlog/index.md
7. Performance Sanity¶
- Full suite: 820s (13:40) — ~2.5% slower than Phase 56 (~800s), well within 10% threshold
- No heavy new tests added
8. Summary¶
- Scope: Over (51 tests vs 25 planned, 2 unplanned engine/security fixes)
- Quality: High (structural + integration + unit mix, good edge case coverage)
- Integration: Fully wired (no dead code)
- Deficits: 0 new items
- Final state: 8,383 Python + 272 frontend = 8,655 total tests, 37 scenarios, Block 6 COMPLETE