Phase 10: Full Campaign Validation & Backtesting¶

Summary¶

Phase 10 validates the complete simulation engine at campaign scale against documented historical outcomes. Unlike Phase 7 (which validated individual engagements with pre-scripted behavior), Phase 10 runs multi-day campaigns with AI commanders, logistics, C2, reinforcements — all 11 domain modules interacting through the Phase 9 SimulationEngine.

Two historical campaigns validated: 1. Golan Heights (Oct 6-10, 1973) — 4-day land campaign with Israeli defense → reinforcement → counterattack 2. Falklands San Carlos (May 21-25, 1982) — 5-day naval air defense campaign with Argentine air raids

Test count: 196 new tests. Running total: 3,782 tests.

What Was Built¶

New Source Modules (5)¶

Module	Purpose
`validation/campaign_data.py`	`HistoricalCampaign` model + `AIExpectation` + `CampaignDataLoader`
`validation/campaign_runner.py`	`CampaignRunner` wrapping `ScenarioLoader` + `SimulationEngine`
`validation/campaign_metrics.py`	Campaign-level metric extraction (`CampaignValidationMetrics`)
`validation/ai_validation.py`	AI decision quality analysis (`AIDecisionValidator`)
`validation/performance.py`	cProfile + tracemalloc profiling (`PerformanceProfiler`)

Modified Source Files (2)¶

File	Change
`validation/monte_carlo.py`	Added `CampaignMonteCarloHarness` + `_run_campaign_iteration()`
`validation/__init__.py`	Updated docstring to cover campaign validation

New YAML Scenario Files (2)¶

File	Scenario
`data/scenarios/golan_campaign/scenario.yaml`	4-day Golan Heights campaign
`data/scenarios/falklands_campaign/scenario.yaml`	5-day Falklands San Carlos campaign

New Test Files (9)¶

File	Tests
`tests/validation/test_campaign_data.py`	33
`tests/validation/test_campaign_runner.py`	26
`tests/validation/test_campaign_metrics.py`	32
`tests/validation/test_campaign_mc.py`	10
`tests/validation/test_ai_validation.py`	31
`tests/validation/test_performance.py`	10
`tests/validation/test_golan_campaign.py`	24 (+5 slow)
`tests/validation/test_falklands_campaign.py`	21 (+2 slow)
`tests/integration/test_phase10_integration.py`	9
`tests/validation/test_campaign_performance.py`	0 (+10 slow)

Design Decisions¶

DD-1: HistoricalCampaign wraps CampaignScenarioConfig fields¶

Mirrors the Phase 7 pattern where HistoricalEngagement wraps engagement fields plus documented_outcomes. CampaignDataLoader.to_scenario_config() strips validation-only fields and produces a config suitable for ScenarioLoader.load().

DD-2: CampaignRunner wraps ScenarioLoader + SimulationEngine¶

Single run() call: converts HistoricalCampaign → temp YAML → ScenarioLoader.load() → SimulationEngine.run() → CampaignRunResult. All domain wiring stays in ScenarioLoader (DRY).

DD-3: Separate campaign_metrics.py¶

Campaign metrics (units destroyed, exchange ratio, campaign duration, territory control) are distinct from engagement metrics. Follows the same static-method design as EngagementMetrics.

DD-4: CampaignMonteCarloHarness extends monte_carlo.py¶

Reuses MonteCarloConfig, RunResult, MonteCarloResult, ComparisonReport. _run_campaign_iteration() is a top-level picklable function (same pattern as _run_single_iteration()).

DD-5: AI decision validation via recorder events¶

AIDecisionValidator scans RecordedEvent entries for AI event types (OODAPhaseChangeEvent, DecisionMadeEvent, etc.) and matches actions against expected postures with configurable tolerance.

DD-6: Two historical campaigns (Golan + Falklands)¶

Covers land and naval domains. Golan tests reinforcements, defense→offense transition, morale cascade. Falklands tests naval C2, air defense, multiple engagement waves.

DD-7: Wider tolerances for campaign-level comparison¶

Default tolerance_factor=3.0 for campaign documented_outcomes vs 2.0 for engagement-level.

Deviations from Plan¶

Test count came in at 196 vs planned ~290. The plan over-estimated per-module test counts; actual tests are more focused and avoid redundancy.
No separate test_campaign_performance.py non-slow tests — all performance tests are @pytest.mark.slow as they require real campaign runs.
The integration test test_reasonable_historical_passes needed adjustment: with max_ticks=10, campaign duration is ~3600s, requiring a historical value of 3600 (not 86400) for the tolerance test.

Severity	Limitation
MAJOR	No fire rate limiting — units fire once per tick regardless of ROF (inherited from Phase 7)
MAJOR	No wave attack modeling — all red units advance simultaneously (inherited from Phase 7)
MAJOR	Campaign AI decisions are coarse — OODA cycle operates at echelon timing scales, may not produce tactical-level posture changes within short MC runs
MINOR	Simplified force compositions — representative unit samples, not complete historical OOB
MINOR	Synthetic terrain — programmatic heightmaps, not real topographic data
MINOR	Fixed reinforcement schedule — deterministic arrival times, no stochastic variation
MINOR	No force aggregation/disaggregation — all units individually tracked (performance concern for large campaigns)
MINOR	AI expectation matching is approximate — posture detection based on action string matching, not deep behavioral analysis
MINOR	Campaign metrics proxy territory control via unit survival fraction rather than spatial objective control
COSMETIC	`_decide_brigade_div` hardcodes `echelon_level=9` in result (inherited from Phase 8)

Lessons Learned¶

CampaignRunner temp YAML pattern works well: Writing a temp YAML for ScenarioLoader avoids building a separate config→context path. The overhead is negligible.
Small tick limits (max_ticks=5-20) make fast unit tests possible: Campaign tests complete in <0.5s by limiting ticks while still exercising the full wiring.
Monte Carlo at campaign level reuses all engagement MC infrastructure: MonteCarloResult.compare_to_historical() works unchanged because both harnesses produce RunResult with dict[str, float] metrics.
AI decision extraction depends on event generation: With very few ticks, AI modules may not complete an OODA cycle, producing zero decisions. Tests must account for this.
Tolerance factor of 3.0 is essential for campaign-level comparison: Multi-day campaigns with AI commander variability and stochastic reinforcement effects produce wide outcome distributions.