Phase 83: Profiling Infrastructure¶

Block: 9 (Performance at Scale) Status: Complete Tests: 15 (4 benchmark infra + 3 benchmark 73E + 2 benchmark Golan + 6 profiling tools — 9 benchmark-marked, excludable)

Overview¶

Establishes measurement infrastructure before any Block 9 optimizations. Automated benchmarks, JSON baseline tracking, regression detection, profiling tooling, and a CI workflow for catching performance regressions.

What Was Built¶

83a: Benchmark Suite (`tests/benchmarks/benchmark_suite.py`)¶

BenchmarkResult (frozen dataclass): scenario_name, unit_count, wall_clock_s, ticks_executed, ticks_per_second, peak_memory_mb, hotspots, seed, winner, commit
BaselineEntry (frozen dataclass): wall_clock_s, ticks_executed, peak_memory_mb, commit, timestamp
BenchmarkBaseline class: load/save/update/check_regression against baselines.json
Regression check: result.wall_clock_s > baseline * (1 + margin) where margin defaults to 0.2 (20%)
run_benchmark(): Follows _run_scenario() pattern from test_battle_perf.py
profile=True: wraps with cProfile + tracemalloc (adds ~10x overhead on Windows)
profile=False: wall clock only (used by regression tests)
Commit hash from git rev-parse --short HEAD with $GITHUB_SHA fallback

83b: Profiling Tooling (`stochastic_warfare/tools/profiling.py`)¶

generate_hotspot_report(): Formatted top-N hotspot table with cumulative time, % of total, call count
compare_profiles(): Side-by-side metric comparison with FASTER/SLOWER indicators
save_flame_graph(): Optional py-spy integration (returns None if py-spy not installed)

83c: CI Benchmark Workflow (`.github/workflows/benchmark.yml`)¶

Triggers: push to main, PR to main, manual workflow_dispatch
Runs 73 Easting benchmark on every push/PR
Golan Heights benchmark on manual dispatch only (too slow for every PR)
Infrastructure tests (fast, no scenario runs)
Uses --override-ini="addopts=" to clear default marker exclusions

83d: Baselines (`tests/benchmarks/baselines.json`)¶

Measured locally (Windows, profile=False): - 73 Easting: 6.65s wall clock, 185 ticks, 71 units → baseline 8.0s - Golan Heights: 433s wall clock, 6480 ticks, 290 units → baseline 500s

Files Changed¶

New files (7): - tests/benchmarks/benchmark_suite.py — core benchmark module - tests/benchmarks/baselines.json — baseline values - tests/benchmarks/test_benchmarks.py — benchmark tests (9 tests, all @pytest.mark.benchmark) - stochastic_warfare/tools/profiling.py — hotspot reports, profile comparison, flame graphs - tests/tools/__init__.py — test package init - tests/tools/test_profiling.py — profiling tool tests (6 tests) - .github/workflows/benchmark.yml — CI benchmark workflow

Modified files (7): - tests/validation/test_historical_accuracy.py — marked evaluator classes @pytest.mark.slow (CI timeout fix) - tests/validation/test_phase_67_block7_validation.py — marked evaluator classes @pytest.mark.slow (same pattern) - docs/development-phases-block9.md — Phase 83 status → Complete - docs/devlog/index.md — Phase 83 row + Block 9 header - mkdocs.yml — Phase 83 nav entry - CLAUDE.md — Phase 83 row, Block 9 status - README.md — Phase 83 row, Block 9 IN PROGRESS - docs/index.md — Block 9 row

Key Decisions¶

profile=False for regression tests: cProfile adds ~10x overhead on Windows (6.65s → 64.68s for 73 Easting). Regression tests use wall clock only; profiling is separate.
20% regression margin: Accommodates ±15% CI runner variance while catching real regressions.
Golan Heights is manual-only in CI: 433s locally, would time out frequently in CI. Available via workflow_dispatch with run_golan=true.
Baselines from local measurement: Will need updating from CI environment for accurate ubuntu-latest baselines.

Lessons Learned¶

cProfile overhead is enormous on Windows: 10x slowdown, not the 30-50% cited in literature. profile=False mode essential for regression checks.
tracemalloc misses C allocations: Peak memory reads 0.0 MB without profiling since numpy allocations are in C. Only meaningful with profile=True.
Falsy default trap: hotspots or [defaults] treats [] as falsy, using defaults instead. Must use if hotspots is None pattern.

Known Limitations¶

Peak memory via tracemalloc is a lower bound (misses numpy/C allocations)
Baselines are from Windows; CI (ubuntu-latest) will have different absolute values
Flame graph support requires py-spy to be installed externally

Postmortem¶

Scope: On target (slightly over — 15 tests vs planned ~12, plus unplanned CI fix) Quality: High — edge cases covered, both synthetic and real-scenario tests, proper marker discipline Integration: Fully wired — all new files imported by tests, CI workflow references test file, baselines validated Deficits: 0 new (known limitations documented above are accepted) Unplanned: Marked test_historical_accuracy.py evaluator classes as @pytest.mark.slow — the full-evaluator subprocess exceeded 900s timeout on CI after Phase 80/81 enabled more enable_* flags on scenarios Action items: None — all exit criteria met