Skip to content

Stochastic Warfare -- Block 9 Development Phases (83--91)

Philosophy

Block 9 is the performance at scale block. The engine produces historically validated results for scenarios up to ~300 units (Golan Heights: 120s). Real-world operational scenarios involve thousands of units. The tick loop is single-threaded, detection scales quadratically, and there is no level-of-detail system. Every future development direction — more scenarios, larger campaigns, real-time visualization, interactive play — depends on the engine handling 1,000+ units at reasonable speed.

This block measures first, then attacks the dominant O(n²) bottleneck (FOW detection), reduces effective unit count via LOD and aggregation, optimizes engagement selection and calibration lookups, expands JIT compilation, introduces SoA data for vectorization, adds per-side parallelism, and validates everything at scale.

Performance targets:

Scale Target Current (projected)
1,000 units, 6h scenario <5 min ~45 min
5,000 units, 6h scenario <30 min ~18 hr
10,000 units, 6h scenario <2 hr infeasible

Exit criteria: 1. Golan Heights (290 units) under 60s (from 120s) 2. New 1,000-unit benchmark scenario completes in <5 min 3. New 5,000-unit benchmark scenario completes in <30 min 4. All 44 existing scenarios produce correct winners (recalibrated if needed) 5. Deterministic reproducibility preserved (same seed = same result) 6. Profiling CI catches regressions (>20% slowdown fails build) 7. All existing tests pass (no behavioral regressions)

Cross-document alignment: This document must stay synchronized with brainstorm-block9.md (design thinking), devlog/index.md (phase status), and specs/project-structure.md (module definitions). Run /cross-doc-audit after any structural change.

No new simulation capabilities: Block 9 makes existing capabilities faster. No new combat models, eras, or subsystems. All changes are internal optimizations gated behind enable_* flags where they affect behavior.


Phase 83: Profiling Infrastructure

Status: Complete.

Goal: Establish measurement infrastructure before optimizing. Automated benchmarks, baseline tracking, regression detection, flame graph generation.

Dependencies: Block 8 complete (Phase 82).

83a: Benchmark Suite

Formalize the existing ad-hoc performance tests into a structured benchmark suite with JSON baseline tracking.

  • tests/benchmarks/benchmark_suite.py (new) -- Structured benchmark runner:
  • BenchmarkResult dataclass: scenario name, unit count, wall_clock_s, ticks_executed, ticks_per_second, peak_memory_mb, hotspots (top 20 by cumulative time)
  • BenchmarkBaseline class: loads/saves JSON baselines (tests/benchmarks/baselines.json)
  • run_benchmark(scenario_path, seed=42) -> BenchmarkResult: cProfile + tracemalloc wrapper
  • Regression check: result.wall_clock_s > baseline * 1.2 → FAIL
  • tests/benchmarks/baselines.json (new) -- Baseline results per scenario:
  • 73_easting: wall_clock_s, ticks, memory
  • golan_heights: wall_clock_s, ticks, memory
  • Format: {"scenario": {"wall_clock_s": float, "ticks_executed": int, "peak_memory_mb": float, "commit": str}}
  • tests/benchmarks/test_benchmarks.py (new) -- Parametrized benchmark tests:
  • @pytest.mark.benchmark marker (excluded by default, run via pytest -m benchmark)
  • 73 Easting benchmark (<30s assertion + baseline comparison)
  • Golan Heights benchmark (<120s assertion + baseline comparison)
  • Determinism verification (same seed = same winner + casualties)

Tests (~6): - Benchmark suite runs and produces BenchmarkResult - Baseline JSON loads/saves correctly - Regression detection triggers on >20% slowdown - Determinism check verifies identical results across runs

83b: Profiling Tooling

Extend the existing /profile skill with flame graph support and structured hotspot reporting.

  • stochastic_warfare/tools/profiling.py (modified) -- Add:
  • generate_hotspot_report(result: BenchmarkResult) -> str: Formatted top-20 hotspots with % of total time
  • save_flame_graph(scenario_path, output_path): Optional py-spy integration (requires py-spy installed)
  • compare_profiles(before: BenchmarkResult, after: BenchmarkResult) -> str: Side-by-side comparison

Tests (~4): - Hotspot report formatting - Profile comparison output - Flame graph path generation (no assertion on py-spy availability)

83c: CI Benchmark Workflow

Add a GitHub Actions workflow for automated performance regression detection.

  • .github/workflows/benchmark.yml (new) -- Benchmark CI:
  • Trigger: push to main, PR to main
  • Steps: checkout, uv sync, run 73 Easting benchmark, compare against baselines.json
  • Fail if >20% regression
  • Upload benchmark results as artifact
  • Golan Heights benchmark on workflow_dispatch only (too slow for every PR)

Tests (~2): - Workflow YAML is valid - Baseline file exists and is parseable

Exit Criteria

  • Benchmark suite runs and produces structured results
  • Baseline JSON established for 73 Easting and Golan Heights
  • CI workflow detects >20% regressions
  • /profile skill generates hotspot reports

Phase 84: Spatial Culling & Scan Scheduling

Status: Complete.

Goal: Address the #1 bottleneck — O(n²) FOW detection — with STRtree range culling and sensor scan intervals. Target: 10-30x detection speedup.

Dependencies: Phase 83 (profiling baseline established).

84a: STRtree Detection Culling

Build a per-tick spatial index of unit positions. Cull FOW detection loop to only check targets within sensor max range.

  • stochastic_warfare/detection/fog_of_war.py (modified) -- In detection sweep:
  • At start of tick: build STRtree from all unit positions (one tree per side, or one global tree with side filtering)
  • For each sensor: query tree for units within sensor.max_range_m radius
  • Pass only matching targets to check_detection() instead of all enemies
  • Gate behind enable_detection_culling: bool = True CalibrationSchema flag (default True — safe optimization, no behavioral change for targets outside sensor range)
  • stochastic_warfare/simulation/calibration.py (modified) -- Add enable_detection_culling: bool = True

Tests (~12): - Unit outside sensor range is not checked (verify reduced call count) - Unit inside sensor range is still detected (no false negatives) - STRtree build + query faster than brute-force at 100, 500, 1000 units - Edge case: unit exactly at max range boundary (include, not exclude) - Edge case: sensor with unlimited range (no culling applied) - Determinism: identical results with and without culling for all existing scenarios - Performance: Golan Heights tick time reduced (measure, don't assert specific %)

84b: Sensor Scan Scheduling

Stagger sensor scans across ticks based on scan interval. Not every sensor scans every tick.

  • data/sensors/*.yaml + data/eras/*/sensors/*.yaml (modified, 33 files) -- Add scan_interval_ticks: int field to sensor YAML:
  • Radar: 2-4 ticks (rotating antenna, 5-12s scan period at 5s/tick)
  • Visual: 1 tick (continuous observation)
  • Thermal/IR: 1-2 ticks (near-continuous)
  • Sonar: 3-5 ticks (acoustic integration time)
  • Default: 1 (backward compatible — every tick)
  • stochastic_warfare/detection/sensors.py (modified) -- Add scan_interval_ticks to SensorDefinition model (Field(default=1, ge=1))
  • stochastic_warfare/detection/fog_of_war.py (modified) -- In detection sweep:
  • Before scanning with a sensor: check current_tick % sensor.scan_interval_ticks == offset
  • offset derived from hash(sensor_id) % scan_interval_ticks to distribute scans evenly
  • Last detection result persists until next scan (no "forgetting" between scans)
  • Gate behind enable_scan_scheduling: bool = False CalibrationSchema flag (default False — opt-in since this changes detection timing)
  • stochastic_warfare/simulation/calibration.py (modified) -- Add enable_scan_scheduling: bool = False

Tests (~10): - Radar sensor with interval=3 only scans on ticks 0, 3, 6, 9, ... - Visual sensor with interval=1 scans every tick (backward compat) - Detection result persists between scans (target remains "detected" until next scan says otherwise) - Scan offset distributes multiple radars evenly (not all on same tick) - enable_scan_scheduling=False → all sensors scan every tick (backward compat) - Performance: detection checks reduced by ~50-67% with typical sensor mix

84c: Engagement Candidate Culling

Use the same spatial index from 84a to pre-filter engagement targets in the battle loop.

  • stochastic_warfare/simulation/battle.py (modified) -- In _execute_engagements():
  • Before threat scoring: query spatial index for enemies within attacker's max_weapon_range_m
  • Score only candidate set (not all enemies)
  • Reuse per-tick STRtree built in 84a (pass via context or battle manager attribute)
  • No new flag (pure optimization — scoring subset of enemies produces identical best-target selection since out-of-range enemies would score 0)

Tests (~8): - Same target selected with and without culling (determinism) - Candidate set size matches manual range filter - Zero-candidate case: no enemies in range → skip engagement (already handled) - Performance: threat scoring time reduced proportionally to candidate reduction

Exit Criteria

  • Golan Heights benchmark <90s (from 120s baseline — conservative 25% improvement)
  • 73 Easting benchmark unchanged (small scenario, no benefit from culling)
  • All 44 scenarios produce identical results (culling is transparent)
  • Profiling shows detection phase dropped from ~70% to <30% of tick time

Phase 85: LOD & Aggregation

Status: Complete.

Goal: Reduce effective unit count by classifying units into resolution tiers and activating the existing aggregation engine. Target: 5-6x tick reduction for 1,000+ unit scenarios.

Dependencies: Phase 84 (spatial index available for tier classification).

85a: Unit Resolution Tiers

Classify units each tick into Active/Nearby/Distant tiers with different update frequencies.

  • stochastic_warfare/simulation/battle.py (modified) -- Add LOD tier system:
  • _classify_unit_tier(unit, enemy_positions, spatial_index) -> Tier using spatial index from Phase 84:
    • ACTIVE: in engagement or within 2× max weapon range of any enemy
    • NEARBY: within max sensor range of any enemy but not ACTIVE
    • DISTANT: beyond any sensor range of any enemy
  • Tier update frequency: ACTIVE=every tick, NEARBY=every 5 ticks, DISTANT=every 20 ticks
  • On non-update ticks for NEARBY/DISTANT: skip detection, morale, logistics; run movement only
  • Hysteresis: unit must be in new tier for 3 consecutive ticks before reclassification (prevents flickering)
  • Instant promotion: any unit that takes damage or detects a new contact → immediately ACTIVE
  • Gate behind enable_lod: bool = False CalibrationSchema flag
  • stochastic_warfare/simulation/calibration.py (modified) -- Add:
  • enable_lod: bool = False
  • lod_nearby_interval: int = 5 — tick interval for NEARBY tier
  • lod_distant_interval: int = 20 — tick interval for DISTANT tier
  • lod_hysteresis_ticks: int = 3 — ticks before tier downgrade

Tests (~14): - Unit in engagement classified as ACTIVE - Unit far from all enemies classified as DISTANT - DISTANT unit only updated every 20 ticks (verify skip) - NEARBY unit only updated every 5 ticks - Hysteresis prevents single-tick flickering - Unit that takes damage instantly promoted to ACTIVE - enable_lod=False → all units processed every tick (backward compat) - Tier boundaries correct at 2× weapon range (ACTIVE) and sensor range (NEARBY) - Performance: 1000-unit scenario with LOD vs without (measure improvement)

85b: Aggregation Activation

Fix order preservation in ForceAggregationEngine and activate it.

  • stochastic_warfare/simulation/aggregation.py (modified) -- Fix disaggregation:
  • Before aggregation: snapshot each unit's current order in _pre_aggregation_orders: dict[str, Order | None]
  • On disaggregation: restore orders from snapshot
  • Units that were idle (no order) before aggregation remain idle after
  • Clear snapshot after successful disaggregation
  • stochastic_warfare/simulation/engine.py (modified) -- Wire aggregation into campaign tick:
  • When enable_aggregation=True: aggregate distant units (reuse LOD tier from 85a — DISTANT units aggregate)
  • Disaggregate when aggregate enters NEARBY range
  • Existing enable_aggregation flag (Phase 13, default False)

Tests (~10): - Aggregation preserves unit orders (snapshot/restore roundtrip) - Idle units remain idle after disaggregation - Aggregate moves at weighted average speed of component units - Disaggregation triggers when aggregate enters NEARBY range (from LOD spatial index) - Aggregate supply consumption matches sum of component unit rates - Aggregate detection signature is sum of component signatures (larger = easier to detect) - enable_aggregation=False → no aggregation (backward compat)

85c: LOD + Aggregation Integration

Verify the compound effect of LOD and aggregation together.

  • Tests (~6):
  • 1000-unit scenario: measure effective unit count with LOD only vs LOD+aggregation
  • Distant aggregate of 50 units updated every 20 ticks (compound effect: 50 units → 1 entity × 1/20 frequency = 1000x reduction)
  • Aggregation respects LOD tier transitions (disaggregate → ACTIVE, don't skip to DISTANT)
  • Full scenario: results within acceptable tolerance of non-LOD results (not identical due to update frequency changes, but same winner)

Exit Criteria

  • 1000-unit benchmark with LOD+aggregation: effective processing load <200 units/tick
  • Golan Heights: <80s (LOD has limited effect at 290 units since most are engaged)
  • All existing scenarios correct with enable_lod=False (default, backward compat)
  • Order preservation roundtrip verified

Phase 86: Engagement & Calibration Optimization

Status: Complete.

Goal: Optimize engagement selection and CalibrationSchema access patterns. Low-effort, low-risk improvements.

Dependencies: Phase 84 (spatial index for candidate culling already done in 84c).

86a: CalibrationSchema Flat Dict

Pre-compute a flat lookup dict at scenario load time for O(1) calibration access.

  • stochastic_warfare/simulation/calibration.py (modified) -- Add:
  • CalibrationSchema.to_flat_dict(sides: list[str]) -> dict[str, Any]: expand all fields including side-prefixed variants into flat dict
  • Side-prefixed keys: "{side}_{field}" for hit_probability_modifier, force_ratio_modifier, cohesion, target_size_modifier
  • Called once at scenario load time in ScenarioLoader
  • stochastic_warfare/simulation/battle.py (modified) -- Replace cal.get("key", default) with cal_flat["key"]:
  • ~100 replacements in engagement loop, movement loop, morale loop
  • Side-prefixed keys now resolved at lookup time (cal_flat[f"{side}_force_ratio_modifier"])
  • Preserve cal.get() API for backward compat in external callers (flat dict is internal optimization)
  • stochastic_warfare/simulation/scenario.py (modified) -- Generate flat dict at load time:
  • ctx.cal_flat = cal_schema.to_flat_dict(side_names) on SimulationContext

Tests (~8): - Flat dict contains all 125+ fields - Side-prefixed keys generated correctly for both sides - cal_flat["enable_fuel_consumption"] matches cal.enable_fuel_consumption - Flat dict is immutable after creation (dict, not defaultdict) - Performance: measure cal_flat["key"] vs cal.get("key", default) × 10K lookups - Full scenario: identical results with flat dict vs pydantic access

86b: Detection Modifier Batching

Batch the 20+ detection range modifiers into a single pre-computed multiplier per unit.

  • stochastic_warfare/simulation/battle.py (modified) -- In engagement loop:
  • Pre-compute _detection_modifier: dict[str, float] per unit at start of tick:
    • Weather visibility factor
    • Night/thermal factor
    • Concealment factor (terrain-dependent, already per-target)
    • MOPP factor
    • Icing factor
    • Naval posture factor
    • Obscurant spectral factor
  • During engagement: multiply sensor range by pre-computed modifier instead of evaluating each check inline
  • Concealment remains per-target (depends on target position), all other modifiers are per-observer

Tests (~6): - Pre-computed modifier matches inline computation for all modifier types - Per-target concealment still computed inline (not pre-computed) - Identical results for all existing scenarios - Performance: engagement modifier cascade time reduced

Exit Criteria

  • CalibrationSchema flat dict generates correctly for all scenarios
  • Detection modifier batching produces identical results
  • Measurable tick time reduction (profile before/after)
  • All existing tests pass

Phase 87: Expanded Numba JIT

Status: Complete.

Goal: JIT-compile detection SNR computation, engagement resolution math, and morale state transitions. Target: 5-10x speedup on JIT-able paths.

Dependencies: Phase 86 (flat dict provides simple data types for Numba compatibility).

87a: Detection SNR Kernels

JIT-compile the SNR computation functions for all sensor types.

  • stochastic_warfare/detection/detection.py (modified) -- Add @optional_jit to:
  • compute_snr_visual(signal, noise, range_m, ...) -> float
  • compute_snr_thermal(signal, noise, range_m, ...) -> float
  • compute_snr_radar(rcs, power, range_m, ...) -> float
  • compute_snr_acoustic(sl, tl, nl, ...) -> float
  • All are pure scalar math — ideal Numba targets
  • Ensure function signatures use only primitive types (float64, int64, bool)
  • stochastic_warfare/detection/fog_of_war.py (modified) -- Add vectorized detection sweep:
  • _batch_snr_check(observer_pos, target_positions, sensor_params, ...) -> np.ndarray[bool]
  • Numba @guvectorize or prange over target array
  • Returns boolean mask of detected targets (replaces per-target Python loop)

Tests (~8): - JIT SNR matches Python SNR for all 4 sensor types (value equality within float tolerance) - Batch detection produces same results as per-target loop - Performance: batch detection 5-10x faster than loop at 500+ targets - Graceful fallback when Numba not installed

87b: Engagement Math Kernels

JIT-compile engagement resolution math (hit probability, penetration, damage).

  • stochastic_warfare/combat/damage.py (modified) -- Add @optional_jit to:
  • compute_hit_probability(range_m, accuracy, modifiers, ...) -> float
  • compute_penetration(velocity, caliber, armor, obliquity, ...) -> float
  • compute_damage_fraction(penetration, armor, ...) -> float
  • stochastic_warfare/combat/ballistics.py (already JIT) -- Verify existing RK4 kernel coverage

Tests (~6): - JIT hit probability matches Python computation - JIT penetration matches DeMarre formula - Performance: engagement resolution 3-5x faster per engagement - Graceful fallback

87c: Morale State Machine Kernel

JIT-compile the continuous-time Markov morale transition computation.

  • stochastic_warfare/morale/state.py (modified) -- Add @optional_jit to:
  • compute_transition_rates(current_state, stress, cohesion, ...) -> np.ndarray
  • _evaluate_transition(rates, dt, rng_value) -> int — returns new state ordinal
  • Batch version: _batch_morale_update(states, stresses, cohesions, dt, rng_values) -> np.ndarray

Tests (~6): - JIT transition matches Python transition for all 5 morale states - Batch morale update produces same results as per-unit loop - Performance: morale phase 3-5x faster at 500+ units - Graceful fallback

Exit Criteria

  • All JIT kernels produce identical results to Python equivalents
  • Numba available: measurable speedup on profiled paths
  • Numba not available: zero behavioral change (fallback works)
  • All existing tests pass with and without Numba

Phase 88: SoA Data Layer

Status: Complete.

Goal: Introduce Structure-of-Arrays for hot-path unit data. Prerequisite for vectorized bulk operations and Numba prange parallelism.

Dependencies: Phase 87 (JIT kernels ready to consume array data).

88a: UnitArrays Core

Create the SoA data structure and sync protocol.

  • stochastic_warfare/simulation/unit_arrays.py (new) -- UnitArrays class:
  • Fields: positions (n,2), health (n,), ammo (n,), fuel (n,), morale_state (n,) int8, side (n,) int8, operational (n,) bool, max_range (n,), unit_ids (n,) str array
  • from_units(units: list[Unit]) -> UnitArrays: build arrays from Unit objects (start-of-tick sync)
  • sync_to_units(units: list[Unit]): write array values back to Unit objects (end-of-tick sync)
  • filter_by_side(side: int) -> tuple[UnitArrays, np.ndarray]: return filtered view + original indices
  • filter_operational() -> tuple[UnitArrays, np.ndarray]: exclude non-operational units
  • stochastic_warfare/simulation/battle.py (modified) -- Build UnitArrays at start of execute_tick():
  • Replace enemy_pos_arrays dict with UnitArrays (superset of existing Phase 70 position arrays)
  • Vectorized distance matrix: cdist(blue.positions, red.positions) (scipy) or broadcast
  • Gate behind enable_soa: bool = False CalibrationSchema flag

Tests (~12): - Round-trip sync: Unit → UnitArrays → Unit produces identical state - Positions array matches Unit.position.easting/northing - Side filtering produces correct subsets - Operational filtering excludes destroyed units - Distance matrix matches per-pair computation - Performance: vectorized distance 10x+ faster than Python loop at 500 units - enable_soa=False → existing behavior (backward compat)

88b: SoA Detection Integration

Use UnitArrays in the FOW detection loop for vectorized range checks.

  • stochastic_warfare/detection/fog_of_war.py (modified) -- When UnitArrays available:
  • Vectorized range check: np.linalg.norm(observer_pos - targets.positions, axis=1) < max_range (single numpy op)
  • Combined with STRtree culling (Phase 84): STRtree filters to ~100 candidates, then vectorized SNR across candidates
  • Integration with Numba batch kernels (Phase 87): pass array slices to JIT functions

Tests (~6): - Vectorized range check matches per-target check - SoA detection produces identical detections to non-SoA path - Performance: detection phase with SoA + culling + JIT combined

88c: SoA Movement & Morale Integration

Extend UnitArrays usage to movement and morale phases.

  • stochastic_warfare/simulation/battle.py (modified) -- Movement phase:
  • Vectorized position updates: positions += velocity * dt for all units in one operation
  • Fuel consumption: fuel -= distance * rate vectorized across all units
  • Sync back to Unit objects after movement phase
  • stochastic_warfare/simulation/battle.py (modified) -- Morale phase:
  • Batch morale kernel from Phase 87c consumes morale_state and stress arrays directly
  • Sync back morale states to Unit objects after morale phase

Tests (~8): - Vectorized movement matches per-unit movement (position + fuel) - Vectorized morale matches per-unit morale transitions - Full scenario: identical results with and without SoA - Performance: movement + morale phases measurably faster

Exit Criteria

  • UnitArrays round-trip sync verified
  • SoA integrated into detection, movement, and morale phases
  • All existing scenarios produce identical results with enable_soa=False
  • Measurable speedup at 500+ units

Phase 89: Per-Side Parallelism

Status: Complete.

Goal: Thread-based parallelism for detection and movement phases. Each side's detection/movement is independent until engagement resolution.

Dependencies: Phase 88 (SoA data layer enables independent per-side array operations).

89a: Per-Side Detection Threading

Run blue-side and red-side detection sweeps in parallel threads.

  • stochastic_warfare/simulation/battle.py (modified) -- In detection phase:
  • Split UnitArrays by side
  • Submit blue detection + red detection to ThreadPoolExecutor(max_workers=2)
  • Each thread: build side-specific STRtree, run JIT detection sweep on its own UnitArrays slice
  • GIL released during numpy/Numba operations → true parallelism for the vectorized paths
  • Join results before engagement phase
  • Gate behind enable_parallel_detection: bool = False CalibrationSchema flag
  • PRNG determinism: Each side uses its own pre-spawned RNG stream (already separate via ModuleId). Thread scheduling order doesn't affect PRNG sequences because each side consumes its own stream.

Tests (~8): - Parallel detection produces identical results to sequential (determinism) - Both sides' detections complete before engagement phase begins - PRNG streams are independent (no cross-contamination) - Performance: detection phase ~1.5-1.8x faster (not 2x due to GIL contention on Python overhead) - enable_parallel_detection=False → sequential (backward compat) - Thread safety: no shared mutable state between detection threads

89b: Per-Side Movement Threading

Run blue-side and red-side movement in parallel.

  • stochastic_warfare/simulation/battle.py (modified) -- In movement phase:
  • Split UnitArrays by side
  • Submit blue movement + red movement to ThreadPoolExecutor(max_workers=2)
  • Each thread: apply vectorized position + fuel updates on its own UnitArrays slice
  • Join and sync back to Unit objects before detection phase
  • Gate behind enable_parallel_movement: bool = False CalibrationSchema flag

Tests (~6): - Parallel movement produces identical results to sequential - PRNG determinism preserved - Performance: movement phase ~1.5x faster - enable_parallel_movement=False → sequential (backward compat)

89c: Engagement Resolution (Sequential)

Engagement resolution remains sequential for determinism. Document why and verify.

  • Tests (~4):
  • Engagement order is deterministic (sorted by unit_id or position)
  • First-mover advantage is consistent across runs with same seed
  • Engagement results identical regardless of parallel detection/movement flags
  • Full scenario: all 44 scenarios produce correct results with all parallel flags enabled

Exit Criteria

  • Per-side parallelism produces identical results to sequential
  • PRNG determinism preserved across all parallel configurations
  • Detection + movement each ~1.5x faster with parallelism
  • All existing scenarios correct with parallel flags enabled

Phase 90: Validation & Benchmarking

Status: Complete.

Goal: Create large-scale benchmark scenarios (1,000 and 5,000 units), validate performance targets, establish baselines for the new scale.

Dependencies: Phase 89 (all optimizations in place).

90a: Large-Scale Benchmark Scenarios

Create two new scenarios designed for performance testing at scale.

  • data/scenarios/benchmark_battalion/scenario.yaml (new) -- 1,000-unit battalion engagement:
  • Blue: 500 units (mixed armor/infantry/artillery/air defense)
  • Red: 500 units (mixed armor/infantry/artillery)
  • Terrain: 20km × 20km rolling terrain
  • Duration: 6 hours
  • Calibration: enable_detection_culling: true, enable_scan_scheduling: true, enable_lod: true, enable_soa: true
  • Expected outcome: decisive combat (not time_expired)
  • data/scenarios/benchmark_brigade/scenario.yaml (new) -- 5,000-unit brigade engagement:
  • Blue: 2,500 units (full combined arms brigade with logistics tail)
  • Red: 2,500 units (mechanized brigade)
  • Terrain: 50km × 50km
  • Duration: 6 hours
  • Calibration: all performance flags enabled + enable_aggregation: true
  • Expected outcome: decisive combat
  • Unit YAML: Reuse existing modern unit types with count multipliers (no new unit definitions needed)

Tests (~6): - Both scenarios load and validate against pydantic schema - Battalion scenario completes (seed=42, any outcome acceptable for first run) - Brigade scenario completes (seed=42, any outcome acceptable) - Victory condition triggered (not max_ticks safety limit)

90b: Performance Target Validation

Run benchmarks and verify performance targets from brainstorm.

  • tests/benchmarks/test_benchmarks.py (modified) -- Add:
  • Battalion benchmark: <5 min assertion (@pytest.mark.benchmark)
  • Brigade benchmark: <30 min assertion (@pytest.mark.benchmark)
  • Golan Heights regression: <60s assertion (improved from 120s baseline)
  • Profile hotspot comparison: before/after for each optimization phase
  • tests/benchmarks/baselines.json (modified) -- Add battalion and brigade baselines

Tests (~4): - Battalion <5 min - Brigade <30 min - Golan Heights <60s - 73 Easting <15s (should be faster too)

90c: Optimization Flag Impact Matrix

Measure the individual and combined impact of each optimization flag.

  • tests/benchmarks/test_flag_impact.py (new) -- Parametrized tests:
  • Run Golan Heights with each flag individually enabled, then all combined
  • Flags: enable_detection_culling, enable_scan_scheduling, enable_lod, enable_soa, enable_parallel_detection, enable_parallel_movement, enable_aggregation
  • Record wall_clock_s for each combination
  • Generate impact matrix (which flags help most, any negative interactions)

Tests (~8): - Individual flag impact measured for each of 7 flags - Combined flag impact measured - No negative interactions (no flag makes things slower)

Exit Criteria

  • Battalion scenario <5 min
  • Brigade scenario <30 min
  • Golan Heights <60s
  • Impact matrix shows which optimizations contribute most
  • Baselines established for new scenarios

Phase 91: Scenario Recalibration & Regression

Status: Complete.

Goal: Full recalibration pass across all 44+ scenarios. Verify that performance optimizations haven't shifted outcomes, recalibrate where they have, and validate large-scale scenarios produce militarily plausible results.

Dependencies: Phase 90 (all benchmarks established, performance targets met).

91a: Behavioral Impact Assessment

Run all 44 scenarios with and without performance flags to identify outcome shifts.

  • tests/validation/test_block9_regression.py (new) -- For each scenario:
  • Run with all performance flags OFF (baseline behavior)
  • Run with all performance flags ON
  • Compare: winner, victory condition type, tick count, casualty counts
  • Flag any scenario where winner changes or victory type changes
  • Expected: spatial culling (84a) and engagement culling (84c) are transparent (identical results)
  • Expected: scan scheduling (84b) and LOD (85a) may shift timing-sensitive outcomes

Tests (~44): - One parametrized test per scenario (44 scenarios × 2 configurations) - Winner comparison: PASS if same, FLAG if different

91b: Timing-Sensitive Scenario Recalibration

For scenarios where scan scheduling or LOD shifts outcomes, recalibrate.

  • Likely candidates (based on brainstorm analysis):
  • 73 Easting: Thermal detection timing is decisive — scan interval changes could shift first-detection advantage
  • Golan Heights: Defensive timing (who detects whom first at long range) — scan scheduling may shift initial contact timing
  • Falklands Naval: Missile exchange windows are narrow — scan latency could change Exocet detection timing
  • Bekaa Valley: SEAD timing against IADS — radar scan interval directly affects engagement sequence
  • Process: For each flagged scenario:
  • Run MC at 10+ seeds with performance flags ON
  • If correct winner rate drops below 80%: adjust calibration_overrides (CEV, hit modifiers, morale rates)
  • If correct winner rate is 80%+: accept (within statistical noise)
  • Document all recalibrations in devlog

Tests (~13 decisive + variable): - All 13 decisive combat scenarios produce correct winner at 80%+ MC rate - Recalibrated scenarios documented with rationale - Non-decisive scenarios (time_expired) maintain plausible composite scores

91c: Large-Scale Scenario Validation

Verify battalion and brigade benchmark scenarios produce militarily plausible outcomes.

  • Process:
  • Run MC at 5+ seeds for each benchmark scenario
  • Verify: engagements occur (not just movement), casualties are non-trivial, victory condition triggers before max_ticks
  • Verify: force ratio outcomes align with Lanchester expectations (larger/better-equipped side wins)
  • Adjust calibration if outcomes are implausible (e.g., 5000-unit battle with 0 casualties)

Tests (~6): - Battalion scenario: non-zero casualties on both sides - Battalion scenario: decisive victory (not time_expired or max_ticks) - Brigade scenario: non-zero casualties on both sides - Brigade scenario: decisive victory - Both scenarios: winner consistent across 5 seeds (>60% same winner)

91d: Documentation & Lockstep

Update all living documents for Block 9 completion.

  • Files: CLAUDE.md, README.md, docs/index.md, docs/devlog/index.md, mkdocs.yml, MEMORY.md
  • Phase devlog: docs/devlog/phase-91.md with Block 9 retrospective
  • Run /cross-doc-audit to verify all 19 checks pass

Exit Criteria

  • All 44 existing scenarios produce correct winners (recalibrated where needed)
  • All 13 decisive scenarios at 80%+ MC correctness
  • Battalion and brigade scenarios produce plausible outcomes
  • All documentation updated
  • Cross-doc audit passes (19/19)
  • Block 9 COMPLETE

Phase Summary

Phase Focus Tests Cumulative Status
83 Profiling Infrastructure 13 ~10,003 Complete
84 Spatial Culling & Scan Scheduling 31 ~10,034 Complete
85 LOD & Aggregation 30 ~10,064 Complete
86 Engagement & Calibration Optimization 19 ~10,083 Complete
87 Expanded Numba JIT 40 ~10,176 Complete
88 SoA Data Layer 43 ~10,219 Complete
89 Per-Side Parallelism 21 ~10,240 Complete
90 Validation & Benchmarking ~25 ~10,265 Complete
91 Scenario Recalibration & Regression ~58 ~10,322 Complete

Block 9 total: ~279 new tests across 9 completed phases. Cumulative: ~10,322 Python tests + ~316 frontend vitest = ~10,638 total. Block 9 COMPLETE.


Module Index: Block 9 Contributions

Module Phases Changes
detection/fog_of_war.py 84, 87, 88, 89 STRtree culling, scan scheduling, vectorized detection, SoA integration, rng param
detection/detection.py 87, 89 JIT SNR kernels (4 sensor types), rng param for parallel detection
simulation/battle.py 84, 85, 86, 88, 89 Engagement culling, LOD tiers, flat cal dict, modifier batching, SoA sync, per-side threads
simulation/calibration.py 84, 85, 86, 88, 89 enable_detection_culling, enable_scan_scheduling, enable_lod, enable_soa, enable_parallel_detection, flat dict API
simulation/unit_arrays.py 88 New: SoA data structure with sync protocol
simulation/aggregation.py 85 Order preservation fix, LOD-triggered activation
simulation/engine.py 85 Aggregation wiring in campaign tick
simulation/scenario.py 86 Flat cal dict generation at load time
detection/detection.py 87 JIT SNR kernels (4 sensor types)
combat/damage.py 87 JIT hit probability, penetration, damage
morale/state.py 87 JIT morale transition kernel
entities/equipment.py 84 scan_interval_ticks on sensor model
data/sensors/*.yaml 84 scan_interval_ticks field (~16 files)
tools/profiling.py 83 Flame graph, hotspot reports, profile comparison
tests/benchmarks/ 83, 90 Benchmark suite, baselines, flag impact matrix, battalion/brigade benchmarks
.github/workflows/ 83 benchmark.yml
data/scenarios/benchmark_*/ 90 Battalion (1K) and brigade (5K) scenarios

Risk Assessment

Risk Severity Mitigation
STRtree rebuild cost at 5,000+ units Medium Measure in Phase 84; fall back to grid spatial hash if >10ms/tick
Scan scheduling shifts detection timing High enable_scan_scheduling=False default; recalibrate in Phase 91
LOD tier misclassification (ambush missed) High Instant promotion on damage/contact; hysteresis prevents over-eager downgrade
SoA sync bugs (two representations of same data) High Explicit sync points (start/end of tick); round-trip tests
Numba compilation overhead on first call Low cache=True on all JIT decorators; amortized over scenario
Per-side threading breaks determinism High Independent PRNG streams per side; sequential engagement resolution; extensive determinism tests
Aggregation order loss on disaggregation Medium Phase 85b explicitly snapshots/restores orders; tested
Large-scale scenarios produce implausible outcomes Medium Phase 91c validates with MC; recalibrate if needed
GIL limits threading benefit Medium Focus parallel work on numpy/Numba ops which release GIL; expect 1.5x not 2x