Stochastic Warfare -- Block 9 Development Phases (83--91)¶
Philosophy¶
Block 9 is the performance at scale block. The engine produces historically validated results for scenarios up to ~300 units (Golan Heights: 120s). Real-world operational scenarios involve thousands of units. The tick loop is single-threaded, detection scales quadratically, and there is no level-of-detail system. Every future development direction — more scenarios, larger campaigns, real-time visualization, interactive play — depends on the engine handling 1,000+ units at reasonable speed.
This block measures first, then attacks the dominant O(n²) bottleneck (FOW detection), reduces effective unit count via LOD and aggregation, optimizes engagement selection and calibration lookups, expands JIT compilation, introduces SoA data for vectorization, adds per-side parallelism, and validates everything at scale.
Performance targets:
| Scale | Target | Current (projected) |
|---|---|---|
| 1,000 units, 6h scenario | <5 min | ~45 min |
| 5,000 units, 6h scenario | <30 min | ~18 hr |
| 10,000 units, 6h scenario | <2 hr | infeasible |
Exit criteria: 1. Golan Heights (290 units) under 60s (from 120s) 2. New 1,000-unit benchmark scenario completes in <5 min 3. New 5,000-unit benchmark scenario completes in <30 min 4. All 44 existing scenarios produce correct winners (recalibrated if needed) 5. Deterministic reproducibility preserved (same seed = same result) 6. Profiling CI catches regressions (>20% slowdown fails build) 7. All existing tests pass (no behavioral regressions)
Cross-document alignment: This document must stay synchronized with brainstorm-block9.md (design thinking), devlog/index.md (phase status), and specs/project-structure.md (module definitions). Run /cross-doc-audit after any structural change.
No new simulation capabilities: Block 9 makes existing capabilities faster. No new combat models, eras, or subsystems. All changes are internal optimizations gated behind enable_* flags where they affect behavior.
Phase 83: Profiling Infrastructure¶
Status: Complete.
Goal: Establish measurement infrastructure before optimizing. Automated benchmarks, baseline tracking, regression detection, flame graph generation.
Dependencies: Block 8 complete (Phase 82).
83a: Benchmark Suite¶
Formalize the existing ad-hoc performance tests into a structured benchmark suite with JSON baseline tracking.
tests/benchmarks/benchmark_suite.py(new) -- Structured benchmark runner:BenchmarkResultdataclass: scenario name, unit count, wall_clock_s, ticks_executed, ticks_per_second, peak_memory_mb, hotspots (top 20 by cumulative time)BenchmarkBaselineclass: loads/saves JSON baselines (tests/benchmarks/baselines.json)run_benchmark(scenario_path, seed=42) -> BenchmarkResult: cProfile + tracemalloc wrapper- Regression check:
result.wall_clock_s > baseline * 1.2→ FAIL tests/benchmarks/baselines.json(new) -- Baseline results per scenario:73_easting: wall_clock_s, ticks, memorygolan_heights: wall_clock_s, ticks, memory- Format:
{"scenario": {"wall_clock_s": float, "ticks_executed": int, "peak_memory_mb": float, "commit": str}} tests/benchmarks/test_benchmarks.py(new) -- Parametrized benchmark tests:@pytest.mark.benchmarkmarker (excluded by default, run viapytest -m benchmark)- 73 Easting benchmark (<30s assertion + baseline comparison)
- Golan Heights benchmark (<120s assertion + baseline comparison)
- Determinism verification (same seed = same winner + casualties)
Tests (~6): - Benchmark suite runs and produces BenchmarkResult - Baseline JSON loads/saves correctly - Regression detection triggers on >20% slowdown - Determinism check verifies identical results across runs
83b: Profiling Tooling¶
Extend the existing /profile skill with flame graph support and structured hotspot reporting.
stochastic_warfare/tools/profiling.py(modified) -- Add:generate_hotspot_report(result: BenchmarkResult) -> str: Formatted top-20 hotspots with % of total timesave_flame_graph(scenario_path, output_path): Optionalpy-spyintegration (requirespy-spyinstalled)compare_profiles(before: BenchmarkResult, after: BenchmarkResult) -> str: Side-by-side comparison
Tests (~4): - Hotspot report formatting - Profile comparison output - Flame graph path generation (no assertion on py-spy availability)
83c: CI Benchmark Workflow¶
Add a GitHub Actions workflow for automated performance regression detection.
.github/workflows/benchmark.yml(new) -- Benchmark CI:- Trigger: push to main, PR to main
- Steps: checkout, uv sync, run 73 Easting benchmark, compare against baselines.json
- Fail if >20% regression
- Upload benchmark results as artifact
- Golan Heights benchmark on
workflow_dispatchonly (too slow for every PR)
Tests (~2): - Workflow YAML is valid - Baseline file exists and is parseable
Exit Criteria¶
- Benchmark suite runs and produces structured results
- Baseline JSON established for 73 Easting and Golan Heights
- CI workflow detects >20% regressions
/profileskill generates hotspot reports
Phase 84: Spatial Culling & Scan Scheduling¶
Status: Complete.
Goal: Address the #1 bottleneck — O(n²) FOW detection — with STRtree range culling and sensor scan intervals. Target: 10-30x detection speedup.
Dependencies: Phase 83 (profiling baseline established).
84a: STRtree Detection Culling¶
Build a per-tick spatial index of unit positions. Cull FOW detection loop to only check targets within sensor max range.
stochastic_warfare/detection/fog_of_war.py(modified) -- In detection sweep:- At start of tick: build
STRtreefrom all unit positions (one tree per side, or one global tree with side filtering) - For each sensor: query tree for units within
sensor.max_range_mradius - Pass only matching targets to
check_detection()instead of all enemies - Gate behind
enable_detection_culling: bool = TrueCalibrationSchema flag (default True — safe optimization, no behavioral change for targets outside sensor range) stochastic_warfare/simulation/calibration.py(modified) -- Addenable_detection_culling: bool = True
Tests (~12): - Unit outside sensor range is not checked (verify reduced call count) - Unit inside sensor range is still detected (no false negatives) - STRtree build + query faster than brute-force at 100, 500, 1000 units - Edge case: unit exactly at max range boundary (include, not exclude) - Edge case: sensor with unlimited range (no culling applied) - Determinism: identical results with and without culling for all existing scenarios - Performance: Golan Heights tick time reduced (measure, don't assert specific %)
84b: Sensor Scan Scheduling¶
Stagger sensor scans across ticks based on scan interval. Not every sensor scans every tick.
data/sensors/*.yaml+data/eras/*/sensors/*.yaml(modified, 33 files) -- Addscan_interval_ticks: intfield to sensor YAML:- Radar: 2-4 ticks (rotating antenna, 5-12s scan period at 5s/tick)
- Visual: 1 tick (continuous observation)
- Thermal/IR: 1-2 ticks (near-continuous)
- Sonar: 3-5 ticks (acoustic integration time)
- Default: 1 (backward compatible — every tick)
stochastic_warfare/detection/sensors.py(modified) -- Addscan_interval_tickstoSensorDefinitionmodel (Field(default=1, ge=1))stochastic_warfare/detection/fog_of_war.py(modified) -- In detection sweep:- Before scanning with a sensor: check
current_tick % sensor.scan_interval_ticks == offset offsetderived fromhash(sensor_id) % scan_interval_ticksto distribute scans evenly- Last detection result persists until next scan (no "forgetting" between scans)
- Gate behind
enable_scan_scheduling: bool = FalseCalibrationSchema flag (default False — opt-in since this changes detection timing) stochastic_warfare/simulation/calibration.py(modified) -- Addenable_scan_scheduling: bool = False
Tests (~10):
- Radar sensor with interval=3 only scans on ticks 0, 3, 6, 9, ...
- Visual sensor with interval=1 scans every tick (backward compat)
- Detection result persists between scans (target remains "detected" until next scan says otherwise)
- Scan offset distributes multiple radars evenly (not all on same tick)
- enable_scan_scheduling=False → all sensors scan every tick (backward compat)
- Performance: detection checks reduced by ~50-67% with typical sensor mix
84c: Engagement Candidate Culling¶
Use the same spatial index from 84a to pre-filter engagement targets in the battle loop.
stochastic_warfare/simulation/battle.py(modified) -- In_execute_engagements():- Before threat scoring: query spatial index for enemies within attacker's
max_weapon_range_m - Score only candidate set (not all enemies)
- Reuse per-tick STRtree built in 84a (pass via context or battle manager attribute)
- No new flag (pure optimization — scoring subset of enemies produces identical best-target selection since out-of-range enemies would score 0)
Tests (~8): - Same target selected with and without culling (determinism) - Candidate set size matches manual range filter - Zero-candidate case: no enemies in range → skip engagement (already handled) - Performance: threat scoring time reduced proportionally to candidate reduction
Exit Criteria¶
- Golan Heights benchmark <90s (from 120s baseline — conservative 25% improvement)
- 73 Easting benchmark unchanged (small scenario, no benefit from culling)
- All 44 scenarios produce identical results (culling is transparent)
- Profiling shows detection phase dropped from ~70% to <30% of tick time
Phase 85: LOD & Aggregation¶
Status: Complete.
Goal: Reduce effective unit count by classifying units into resolution tiers and activating the existing aggregation engine. Target: 5-6x tick reduction for 1,000+ unit scenarios.
Dependencies: Phase 84 (spatial index available for tier classification).
85a: Unit Resolution Tiers¶
Classify units each tick into Active/Nearby/Distant tiers with different update frequencies.
stochastic_warfare/simulation/battle.py(modified) -- Add LOD tier system:_classify_unit_tier(unit, enemy_positions, spatial_index) -> Tierusing spatial index from Phase 84:ACTIVE: in engagement or within 2× max weapon range of any enemyNEARBY: within max sensor range of any enemy but not ACTIVEDISTANT: beyond any sensor range of any enemy
- Tier update frequency: ACTIVE=every tick, NEARBY=every 5 ticks, DISTANT=every 20 ticks
- On non-update ticks for NEARBY/DISTANT: skip detection, morale, logistics; run movement only
- Hysteresis: unit must be in new tier for 3 consecutive ticks before reclassification (prevents flickering)
- Instant promotion: any unit that takes damage or detects a new contact → immediately ACTIVE
- Gate behind
enable_lod: bool = FalseCalibrationSchema flag stochastic_warfare/simulation/calibration.py(modified) -- Add:enable_lod: bool = Falselod_nearby_interval: int = 5— tick interval for NEARBY tierlod_distant_interval: int = 20— tick interval for DISTANT tierlod_hysteresis_ticks: int = 3— ticks before tier downgrade
Tests (~14):
- Unit in engagement classified as ACTIVE
- Unit far from all enemies classified as DISTANT
- DISTANT unit only updated every 20 ticks (verify skip)
- NEARBY unit only updated every 5 ticks
- Hysteresis prevents single-tick flickering
- Unit that takes damage instantly promoted to ACTIVE
- enable_lod=False → all units processed every tick (backward compat)
- Tier boundaries correct at 2× weapon range (ACTIVE) and sensor range (NEARBY)
- Performance: 1000-unit scenario with LOD vs without (measure improvement)
85b: Aggregation Activation¶
Fix order preservation in ForceAggregationEngine and activate it.
stochastic_warfare/simulation/aggregation.py(modified) -- Fix disaggregation:- Before aggregation: snapshot each unit's current order in
_pre_aggregation_orders: dict[str, Order | None] - On disaggregation: restore orders from snapshot
- Units that were idle (no order) before aggregation remain idle after
- Clear snapshot after successful disaggregation
stochastic_warfare/simulation/engine.py(modified) -- Wire aggregation into campaign tick:- When
enable_aggregation=True: aggregate distant units (reuse LOD tier from 85a — DISTANT units aggregate) - Disaggregate when aggregate enters NEARBY range
- Existing
enable_aggregationflag (Phase 13, default False)
Tests (~10):
- Aggregation preserves unit orders (snapshot/restore roundtrip)
- Idle units remain idle after disaggregation
- Aggregate moves at weighted average speed of component units
- Disaggregation triggers when aggregate enters NEARBY range (from LOD spatial index)
- Aggregate supply consumption matches sum of component unit rates
- Aggregate detection signature is sum of component signatures (larger = easier to detect)
- enable_aggregation=False → no aggregation (backward compat)
85c: LOD + Aggregation Integration¶
Verify the compound effect of LOD and aggregation together.
- Tests (~6):
- 1000-unit scenario: measure effective unit count with LOD only vs LOD+aggregation
- Distant aggregate of 50 units updated every 20 ticks (compound effect: 50 units → 1 entity × 1/20 frequency = 1000x reduction)
- Aggregation respects LOD tier transitions (disaggregate → ACTIVE, don't skip to DISTANT)
- Full scenario: results within acceptable tolerance of non-LOD results (not identical due to update frequency changes, but same winner)
Exit Criteria¶
- 1000-unit benchmark with LOD+aggregation: effective processing load <200 units/tick
- Golan Heights: <80s (LOD has limited effect at 290 units since most are engaged)
- All existing scenarios correct with
enable_lod=False(default, backward compat) - Order preservation roundtrip verified
Phase 86: Engagement & Calibration Optimization¶
Status: Complete.
Goal: Optimize engagement selection and CalibrationSchema access patterns. Low-effort, low-risk improvements.
Dependencies: Phase 84 (spatial index for candidate culling already done in 84c).
86a: CalibrationSchema Flat Dict¶
Pre-compute a flat lookup dict at scenario load time for O(1) calibration access.
stochastic_warfare/simulation/calibration.py(modified) -- Add:CalibrationSchema.to_flat_dict(sides: list[str]) -> dict[str, Any]: expand all fields including side-prefixed variants into flat dict- Side-prefixed keys:
"{side}_{field}"forhit_probability_modifier,force_ratio_modifier,cohesion,target_size_modifier - Called once at scenario load time in
ScenarioLoader stochastic_warfare/simulation/battle.py(modified) -- Replacecal.get("key", default)withcal_flat["key"]:- ~100 replacements in engagement loop, movement loop, morale loop
- Side-prefixed keys now resolved at lookup time (
cal_flat[f"{side}_force_ratio_modifier"]) - Preserve
cal.get()API for backward compat in external callers (flat dict is internal optimization) stochastic_warfare/simulation/scenario.py(modified) -- Generate flat dict at load time:ctx.cal_flat = cal_schema.to_flat_dict(side_names)on SimulationContext
Tests (~8):
- Flat dict contains all 125+ fields
- Side-prefixed keys generated correctly for both sides
- cal_flat["enable_fuel_consumption"] matches cal.enable_fuel_consumption
- Flat dict is immutable after creation (dict, not defaultdict)
- Performance: measure cal_flat["key"] vs cal.get("key", default) × 10K lookups
- Full scenario: identical results with flat dict vs pydantic access
86b: Detection Modifier Batching¶
Batch the 20+ detection range modifiers into a single pre-computed multiplier per unit.
stochastic_warfare/simulation/battle.py(modified) -- In engagement loop:- Pre-compute
_detection_modifier: dict[str, float]per unit at start of tick:- Weather visibility factor
- Night/thermal factor
- Concealment factor (terrain-dependent, already per-target)
- MOPP factor
- Icing factor
- Naval posture factor
- Obscurant spectral factor
- During engagement: multiply sensor range by pre-computed modifier instead of evaluating each check inline
- Concealment remains per-target (depends on target position), all other modifiers are per-observer
Tests (~6): - Pre-computed modifier matches inline computation for all modifier types - Per-target concealment still computed inline (not pre-computed) - Identical results for all existing scenarios - Performance: engagement modifier cascade time reduced
Exit Criteria¶
- CalibrationSchema flat dict generates correctly for all scenarios
- Detection modifier batching produces identical results
- Measurable tick time reduction (profile before/after)
- All existing tests pass
Phase 87: Expanded Numba JIT¶
Status: Complete.
Goal: JIT-compile detection SNR computation, engagement resolution math, and morale state transitions. Target: 5-10x speedup on JIT-able paths.
Dependencies: Phase 86 (flat dict provides simple data types for Numba compatibility).
87a: Detection SNR Kernels¶
JIT-compile the SNR computation functions for all sensor types.
stochastic_warfare/detection/detection.py(modified) -- Add@optional_jitto:compute_snr_visual(signal, noise, range_m, ...) -> floatcompute_snr_thermal(signal, noise, range_m, ...) -> floatcompute_snr_radar(rcs, power, range_m, ...) -> floatcompute_snr_acoustic(sl, tl, nl, ...) -> float- All are pure scalar math — ideal Numba targets
- Ensure function signatures use only primitive types (float64, int64, bool)
stochastic_warfare/detection/fog_of_war.py(modified) -- Add vectorized detection sweep:_batch_snr_check(observer_pos, target_positions, sensor_params, ...) -> np.ndarray[bool]- Numba
@guvectorizeorprangeover target array - Returns boolean mask of detected targets (replaces per-target Python loop)
Tests (~8): - JIT SNR matches Python SNR for all 4 sensor types (value equality within float tolerance) - Batch detection produces same results as per-target loop - Performance: batch detection 5-10x faster than loop at 500+ targets - Graceful fallback when Numba not installed
87b: Engagement Math Kernels¶
JIT-compile engagement resolution math (hit probability, penetration, damage).
stochastic_warfare/combat/damage.py(modified) -- Add@optional_jitto:compute_hit_probability(range_m, accuracy, modifiers, ...) -> floatcompute_penetration(velocity, caliber, armor, obliquity, ...) -> floatcompute_damage_fraction(penetration, armor, ...) -> floatstochastic_warfare/combat/ballistics.py(already JIT) -- Verify existing RK4 kernel coverage
Tests (~6): - JIT hit probability matches Python computation - JIT penetration matches DeMarre formula - Performance: engagement resolution 3-5x faster per engagement - Graceful fallback
87c: Morale State Machine Kernel¶
JIT-compile the continuous-time Markov morale transition computation.
stochastic_warfare/morale/state.py(modified) -- Add@optional_jitto:compute_transition_rates(current_state, stress, cohesion, ...) -> np.ndarray_evaluate_transition(rates, dt, rng_value) -> int— returns new state ordinal- Batch version:
_batch_morale_update(states, stresses, cohesions, dt, rng_values) -> np.ndarray
Tests (~6): - JIT transition matches Python transition for all 5 morale states - Batch morale update produces same results as per-unit loop - Performance: morale phase 3-5x faster at 500+ units - Graceful fallback
Exit Criteria¶
- All JIT kernels produce identical results to Python equivalents
- Numba available: measurable speedup on profiled paths
- Numba not available: zero behavioral change (fallback works)
- All existing tests pass with and without Numba
Phase 88: SoA Data Layer¶
Status: Complete.
Goal: Introduce Structure-of-Arrays for hot-path unit data. Prerequisite for vectorized bulk operations and Numba prange parallelism.
Dependencies: Phase 87 (JIT kernels ready to consume array data).
88a: UnitArrays Core¶
Create the SoA data structure and sync protocol.
stochastic_warfare/simulation/unit_arrays.py(new) --UnitArraysclass:- Fields:
positions (n,2),health (n,),ammo (n,),fuel (n,),morale_state (n,) int8,side (n,) int8,operational (n,) bool,max_range (n,),unit_ids (n,) str array from_units(units: list[Unit]) -> UnitArrays: build arrays from Unit objects (start-of-tick sync)sync_to_units(units: list[Unit]): write array values back to Unit objects (end-of-tick sync)filter_by_side(side: int) -> tuple[UnitArrays, np.ndarray]: return filtered view + original indicesfilter_operational() -> tuple[UnitArrays, np.ndarray]: exclude non-operational unitsstochastic_warfare/simulation/battle.py(modified) -- Build UnitArrays at start ofexecute_tick():- Replace
enemy_pos_arraysdict with UnitArrays (superset of existing Phase 70 position arrays) - Vectorized distance matrix:
cdist(blue.positions, red.positions)(scipy) or broadcast - Gate behind
enable_soa: bool = FalseCalibrationSchema flag
Tests (~12):
- Round-trip sync: Unit → UnitArrays → Unit produces identical state
- Positions array matches Unit.position.easting/northing
- Side filtering produces correct subsets
- Operational filtering excludes destroyed units
- Distance matrix matches per-pair computation
- Performance: vectorized distance 10x+ faster than Python loop at 500 units
- enable_soa=False → existing behavior (backward compat)
88b: SoA Detection Integration¶
Use UnitArrays in the FOW detection loop for vectorized range checks.
stochastic_warfare/detection/fog_of_war.py(modified) -- When UnitArrays available:- Vectorized range check:
np.linalg.norm(observer_pos - targets.positions, axis=1) < max_range(single numpy op) - Combined with STRtree culling (Phase 84): STRtree filters to ~100 candidates, then vectorized SNR across candidates
- Integration with Numba batch kernels (Phase 87): pass array slices to JIT functions
Tests (~6): - Vectorized range check matches per-target check - SoA detection produces identical detections to non-SoA path - Performance: detection phase with SoA + culling + JIT combined
88c: SoA Movement & Morale Integration¶
Extend UnitArrays usage to movement and morale phases.
stochastic_warfare/simulation/battle.py(modified) -- Movement phase:- Vectorized position updates:
positions += velocity * dtfor all units in one operation - Fuel consumption:
fuel -= distance * ratevectorized across all units - Sync back to Unit objects after movement phase
stochastic_warfare/simulation/battle.py(modified) -- Morale phase:- Batch morale kernel from Phase 87c consumes
morale_stateandstressarrays directly - Sync back morale states to Unit objects after morale phase
Tests (~8): - Vectorized movement matches per-unit movement (position + fuel) - Vectorized morale matches per-unit morale transitions - Full scenario: identical results with and without SoA - Performance: movement + morale phases measurably faster
Exit Criteria¶
- UnitArrays round-trip sync verified
- SoA integrated into detection, movement, and morale phases
- All existing scenarios produce identical results with
enable_soa=False - Measurable speedup at 500+ units
Phase 89: Per-Side Parallelism¶
Status: Complete.
Goal: Thread-based parallelism for detection and movement phases. Each side's detection/movement is independent until engagement resolution.
Dependencies: Phase 88 (SoA data layer enables independent per-side array operations).
89a: Per-Side Detection Threading¶
Run blue-side and red-side detection sweeps in parallel threads.
stochastic_warfare/simulation/battle.py(modified) -- In detection phase:- Split UnitArrays by side
- Submit blue detection + red detection to
ThreadPoolExecutor(max_workers=2) - Each thread: build side-specific STRtree, run JIT detection sweep on its own UnitArrays slice
- GIL released during numpy/Numba operations → true parallelism for the vectorized paths
- Join results before engagement phase
- Gate behind
enable_parallel_detection: bool = FalseCalibrationSchema flag - PRNG determinism: Each side uses its own pre-spawned RNG stream (already separate via ModuleId). Thread scheduling order doesn't affect PRNG sequences because each side consumes its own stream.
Tests (~8):
- Parallel detection produces identical results to sequential (determinism)
- Both sides' detections complete before engagement phase begins
- PRNG streams are independent (no cross-contamination)
- Performance: detection phase ~1.5-1.8x faster (not 2x due to GIL contention on Python overhead)
- enable_parallel_detection=False → sequential (backward compat)
- Thread safety: no shared mutable state between detection threads
89b: Per-Side Movement Threading¶
Run blue-side and red-side movement in parallel.
stochastic_warfare/simulation/battle.py(modified) -- In movement phase:- Split UnitArrays by side
- Submit blue movement + red movement to
ThreadPoolExecutor(max_workers=2) - Each thread: apply vectorized position + fuel updates on its own UnitArrays slice
- Join and sync back to Unit objects before detection phase
- Gate behind
enable_parallel_movement: bool = FalseCalibrationSchema flag
Tests (~6):
- Parallel movement produces identical results to sequential
- PRNG determinism preserved
- Performance: movement phase ~1.5x faster
- enable_parallel_movement=False → sequential (backward compat)
89c: Engagement Resolution (Sequential)¶
Engagement resolution remains sequential for determinism. Document why and verify.
- Tests (~4):
- Engagement order is deterministic (sorted by unit_id or position)
- First-mover advantage is consistent across runs with same seed
- Engagement results identical regardless of parallel detection/movement flags
- Full scenario: all 44 scenarios produce correct results with all parallel flags enabled
Exit Criteria¶
- Per-side parallelism produces identical results to sequential
- PRNG determinism preserved across all parallel configurations
- Detection + movement each ~1.5x faster with parallelism
- All existing scenarios correct with parallel flags enabled
Phase 90: Validation & Benchmarking¶
Status: Complete.
Goal: Create large-scale benchmark scenarios (1,000 and 5,000 units), validate performance targets, establish baselines for the new scale.
Dependencies: Phase 89 (all optimizations in place).
90a: Large-Scale Benchmark Scenarios¶
Create two new scenarios designed for performance testing at scale.
data/scenarios/benchmark_battalion/scenario.yaml(new) -- 1,000-unit battalion engagement:- Blue: 500 units (mixed armor/infantry/artillery/air defense)
- Red: 500 units (mixed armor/infantry/artillery)
- Terrain: 20km × 20km rolling terrain
- Duration: 6 hours
- Calibration:
enable_detection_culling: true,enable_scan_scheduling: true,enable_lod: true,enable_soa: true - Expected outcome: decisive combat (not time_expired)
data/scenarios/benchmark_brigade/scenario.yaml(new) -- 5,000-unit brigade engagement:- Blue: 2,500 units (full combined arms brigade with logistics tail)
- Red: 2,500 units (mechanized brigade)
- Terrain: 50km × 50km
- Duration: 6 hours
- Calibration: all performance flags enabled +
enable_aggregation: true - Expected outcome: decisive combat
- Unit YAML: Reuse existing modern unit types with
countmultipliers (no new unit definitions needed)
Tests (~6): - Both scenarios load and validate against pydantic schema - Battalion scenario completes (seed=42, any outcome acceptable for first run) - Brigade scenario completes (seed=42, any outcome acceptable) - Victory condition triggered (not max_ticks safety limit)
90b: Performance Target Validation¶
Run benchmarks and verify performance targets from brainstorm.
tests/benchmarks/test_benchmarks.py(modified) -- Add:- Battalion benchmark: <5 min assertion (
@pytest.mark.benchmark) - Brigade benchmark: <30 min assertion (
@pytest.mark.benchmark) - Golan Heights regression: <60s assertion (improved from 120s baseline)
- Profile hotspot comparison: before/after for each optimization phase
tests/benchmarks/baselines.json(modified) -- Add battalion and brigade baselines
Tests (~4): - Battalion <5 min - Brigade <30 min - Golan Heights <60s - 73 Easting <15s (should be faster too)
90c: Optimization Flag Impact Matrix¶
Measure the individual and combined impact of each optimization flag.
tests/benchmarks/test_flag_impact.py(new) -- Parametrized tests:- Run Golan Heights with each flag individually enabled, then all combined
- Flags:
enable_detection_culling,enable_scan_scheduling,enable_lod,enable_soa,enable_parallel_detection,enable_parallel_movement,enable_aggregation - Record wall_clock_s for each combination
- Generate impact matrix (which flags help most, any negative interactions)
Tests (~8): - Individual flag impact measured for each of 7 flags - Combined flag impact measured - No negative interactions (no flag makes things slower)
Exit Criteria¶
- Battalion scenario <5 min
- Brigade scenario <30 min
- Golan Heights <60s
- Impact matrix shows which optimizations contribute most
- Baselines established for new scenarios
Phase 91: Scenario Recalibration & Regression¶
Status: Complete.
Goal: Full recalibration pass across all 44+ scenarios. Verify that performance optimizations haven't shifted outcomes, recalibrate where they have, and validate large-scale scenarios produce militarily plausible results.
Dependencies: Phase 90 (all benchmarks established, performance targets met).
91a: Behavioral Impact Assessment¶
Run all 44 scenarios with and without performance flags to identify outcome shifts.
tests/validation/test_block9_regression.py(new) -- For each scenario:- Run with all performance flags OFF (baseline behavior)
- Run with all performance flags ON
- Compare: winner, victory condition type, tick count, casualty counts
- Flag any scenario where winner changes or victory type changes
- Expected: spatial culling (84a) and engagement culling (84c) are transparent (identical results)
- Expected: scan scheduling (84b) and LOD (85a) may shift timing-sensitive outcomes
Tests (~44): - One parametrized test per scenario (44 scenarios × 2 configurations) - Winner comparison: PASS if same, FLAG if different
91b: Timing-Sensitive Scenario Recalibration¶
For scenarios where scan scheduling or LOD shifts outcomes, recalibrate.
- Likely candidates (based on brainstorm analysis):
- 73 Easting: Thermal detection timing is decisive — scan interval changes could shift first-detection advantage
- Golan Heights: Defensive timing (who detects whom first at long range) — scan scheduling may shift initial contact timing
- Falklands Naval: Missile exchange windows are narrow — scan latency could change Exocet detection timing
- Bekaa Valley: SEAD timing against IADS — radar scan interval directly affects engagement sequence
- Process: For each flagged scenario:
- Run MC at 10+ seeds with performance flags ON
- If correct winner rate drops below 80%: adjust
calibration_overrides(CEV, hit modifiers, morale rates) - If correct winner rate is 80%+: accept (within statistical noise)
- Document all recalibrations in devlog
Tests (~13 decisive + variable): - All 13 decisive combat scenarios produce correct winner at 80%+ MC rate - Recalibrated scenarios documented with rationale - Non-decisive scenarios (time_expired) maintain plausible composite scores
91c: Large-Scale Scenario Validation¶
Verify battalion and brigade benchmark scenarios produce militarily plausible outcomes.
- Process:
- Run MC at 5+ seeds for each benchmark scenario
- Verify: engagements occur (not just movement), casualties are non-trivial, victory condition triggers before max_ticks
- Verify: force ratio outcomes align with Lanchester expectations (larger/better-equipped side wins)
- Adjust calibration if outcomes are implausible (e.g., 5000-unit battle with 0 casualties)
Tests (~6): - Battalion scenario: non-zero casualties on both sides - Battalion scenario: decisive victory (not time_expired or max_ticks) - Brigade scenario: non-zero casualties on both sides - Brigade scenario: decisive victory - Both scenarios: winner consistent across 5 seeds (>60% same winner)
91d: Documentation & Lockstep¶
Update all living documents for Block 9 completion.
- Files: CLAUDE.md, README.md, docs/index.md, docs/devlog/index.md, mkdocs.yml, MEMORY.md
- Phase devlog:
docs/devlog/phase-91.mdwith Block 9 retrospective - Run
/cross-doc-auditto verify all 19 checks pass
Exit Criteria¶
- All 44 existing scenarios produce correct winners (recalibrated where needed)
- All 13 decisive scenarios at 80%+ MC correctness
- Battalion and brigade scenarios produce plausible outcomes
- All documentation updated
- Cross-doc audit passes (19/19)
- Block 9 COMPLETE
Phase Summary¶
| Phase | Focus | Tests | Cumulative | Status |
|---|---|---|---|---|
| 83 | Profiling Infrastructure | 13 | ~10,003 | Complete |
| 84 | Spatial Culling & Scan Scheduling | 31 | ~10,034 | Complete |
| 85 | LOD & Aggregation | 30 | ~10,064 | Complete |
| 86 | Engagement & Calibration Optimization | 19 | ~10,083 | Complete |
| 87 | Expanded Numba JIT | 40 | ~10,176 | Complete |
| 88 | SoA Data Layer | 43 | ~10,219 | Complete |
| 89 | Per-Side Parallelism | 21 | ~10,240 | Complete |
| 90 | Validation & Benchmarking | ~25 | ~10,265 | Complete |
| 91 | Scenario Recalibration & Regression | ~58 | ~10,322 | Complete |
Block 9 total: ~279 new tests across 9 completed phases. Cumulative: ~10,322 Python tests + ~316 frontend vitest = ~10,638 total. Block 9 COMPLETE.
Module Index: Block 9 Contributions¶
| Module | Phases | Changes |
|---|---|---|
detection/fog_of_war.py |
84, 87, 88, 89 | STRtree culling, scan scheduling, vectorized detection, SoA integration, rng param |
detection/detection.py |
87, 89 | JIT SNR kernels (4 sensor types), rng param for parallel detection |
simulation/battle.py |
84, 85, 86, 88, 89 | Engagement culling, LOD tiers, flat cal dict, modifier batching, SoA sync, per-side threads |
simulation/calibration.py |
84, 85, 86, 88, 89 | enable_detection_culling, enable_scan_scheduling, enable_lod, enable_soa, enable_parallel_detection, flat dict API |
simulation/unit_arrays.py |
88 | New: SoA data structure with sync protocol |
simulation/aggregation.py |
85 | Order preservation fix, LOD-triggered activation |
simulation/engine.py |
85 | Aggregation wiring in campaign tick |
simulation/scenario.py |
86 | Flat cal dict generation at load time |
detection/detection.py |
87 | JIT SNR kernels (4 sensor types) |
combat/damage.py |
87 | JIT hit probability, penetration, damage |
morale/state.py |
87 | JIT morale transition kernel |
entities/equipment.py |
84 | scan_interval_ticks on sensor model |
data/sensors/*.yaml |
84 | scan_interval_ticks field (~16 files) |
tools/profiling.py |
83 | Flame graph, hotspot reports, profile comparison |
tests/benchmarks/ |
83, 90 | Benchmark suite, baselines, flag impact matrix, battalion/brigade benchmarks |
.github/workflows/ |
83 | benchmark.yml |
data/scenarios/benchmark_*/ |
90 | Battalion (1K) and brigade (5K) scenarios |
Risk Assessment¶
| Risk | Severity | Mitigation |
|---|---|---|
| STRtree rebuild cost at 5,000+ units | Medium | Measure in Phase 84; fall back to grid spatial hash if >10ms/tick |
| Scan scheduling shifts detection timing | High | enable_scan_scheduling=False default; recalibrate in Phase 91 |
| LOD tier misclassification (ambush missed) | High | Instant promotion on damage/contact; hysteresis prevents over-eager downgrade |
| SoA sync bugs (two representations of same data) | High | Explicit sync points (start/end of tick); round-trip tests |
| Numba compilation overhead on first call | Low | cache=True on all JIT decorators; amortized over scenario |
| Per-side threading breaks determinism | High | Independent PRNG streams per side; sequential engagement resolution; extensive determinism tests |
| Aggregation order loss on disaggregation | Medium | Phase 85b explicitly snapshots/restores orders; tested |
| Large-scale scenarios produce implausible outcomes | Medium | Phase 91c validates with MC; recalibrate if needed |
| GIL limits threading benefit | Medium | Focus parallel work on numpy/Numba ops which release GIL; expect 1.5x not 2x |