Phase 83: Profiling Infrastructure¶
Block: 9 (Performance at Scale) Status: Complete Tests: 15 (4 benchmark infra + 3 benchmark 73E + 2 benchmark Golan + 6 profiling tools — 9 benchmark-marked, excludable)
Overview¶
Establishes measurement infrastructure before any Block 9 optimizations. Automated benchmarks, JSON baseline tracking, regression detection, profiling tooling, and a CI workflow for catching performance regressions.
What Was Built¶
83a: Benchmark Suite (tests/benchmarks/benchmark_suite.py)¶
BenchmarkResult(frozen dataclass): scenario_name, unit_count, wall_clock_s, ticks_executed, ticks_per_second, peak_memory_mb, hotspots, seed, winner, commitBaselineEntry(frozen dataclass): wall_clock_s, ticks_executed, peak_memory_mb, commit, timestampBenchmarkBaselineclass: load/save/update/check_regression againstbaselines.json- Regression check:
result.wall_clock_s > baseline * (1 + margin)where margin defaults to 0.2 (20%) run_benchmark(): Follows_run_scenario()pattern fromtest_battle_perf.pyprofile=True: wraps with cProfile + tracemalloc (adds ~10x overhead on Windows)profile=False: wall clock only (used by regression tests)- Commit hash from
git rev-parse --short HEADwith$GITHUB_SHAfallback
83b: Profiling Tooling (stochastic_warfare/tools/profiling.py)¶
generate_hotspot_report(): Formatted top-N hotspot table with cumulative time, % of total, call countcompare_profiles(): Side-by-side metric comparison with FASTER/SLOWER indicatorssave_flame_graph(): Optional py-spy integration (returns None if py-spy not installed)
83c: CI Benchmark Workflow (.github/workflows/benchmark.yml)¶
- Triggers: push to main, PR to main, manual workflow_dispatch
- Runs 73 Easting benchmark on every push/PR
- Golan Heights benchmark on manual dispatch only (too slow for every PR)
- Infrastructure tests (fast, no scenario runs)
- Uses
--override-ini="addopts="to clear default marker exclusions
83d: Baselines (tests/benchmarks/baselines.json)¶
Measured locally (Windows, profile=False): - 73 Easting: 6.65s wall clock, 185 ticks, 71 units → baseline 8.0s - Golan Heights: 433s wall clock, 6480 ticks, 290 units → baseline 500s
Files Changed¶
New files (7):
- tests/benchmarks/benchmark_suite.py — core benchmark module
- tests/benchmarks/baselines.json — baseline values
- tests/benchmarks/test_benchmarks.py — benchmark tests (9 tests, all @pytest.mark.benchmark)
- stochastic_warfare/tools/profiling.py — hotspot reports, profile comparison, flame graphs
- tests/tools/__init__.py — test package init
- tests/tools/test_profiling.py — profiling tool tests (6 tests)
- .github/workflows/benchmark.yml — CI benchmark workflow
Modified files (7):
- tests/validation/test_historical_accuracy.py — marked evaluator classes @pytest.mark.slow (CI timeout fix)
- tests/validation/test_phase_67_block7_validation.py — marked evaluator classes @pytest.mark.slow (same pattern)
- docs/development-phases-block9.md — Phase 83 status → Complete
- docs/devlog/index.md — Phase 83 row + Block 9 header
- mkdocs.yml — Phase 83 nav entry
- CLAUDE.md — Phase 83 row, Block 9 status
- README.md — Phase 83 row, Block 9 IN PROGRESS
- docs/index.md — Block 9 row
Key Decisions¶
profile=Falsefor regression tests: cProfile adds ~10x overhead on Windows (6.65s → 64.68s for 73 Easting). Regression tests use wall clock only; profiling is separate.- 20% regression margin: Accommodates ±15% CI runner variance while catching real regressions.
- Golan Heights is manual-only in CI: 433s locally, would time out frequently in CI. Available via
workflow_dispatchwithrun_golan=true. - Baselines from local measurement: Will need updating from CI environment for accurate ubuntu-latest baselines.
Lessons Learned¶
- cProfile overhead is enormous on Windows: 10x slowdown, not the 30-50% cited in literature.
profile=Falsemode essential for regression checks. tracemallocmisses C allocations: Peak memory reads 0.0 MB without profiling since numpy allocations are in C. Only meaningful withprofile=True.- Falsy default trap:
hotspots or [defaults]treats[]as falsy, using defaults instead. Must useif hotspots is Nonepattern.
Known Limitations¶
- Peak memory via tracemalloc is a lower bound (misses numpy/C allocations)
- Baselines are from Windows; CI (ubuntu-latest) will have different absolute values
- Flame graph support requires py-spy to be installed externally
Postmortem¶
Scope: On target (slightly over — 15 tests vs planned ~12, plus unplanned CI fix)
Quality: High — edge cases covered, both synthetic and real-scenario tests, proper marker discipline
Integration: Fully wired — all new files imported by tests, CI workflow references test file, baselines validated
Deficits: 0 new (known limitations documented above are accepted)
Unplanned: Marked test_historical_accuracy.py evaluator classes as @pytest.mark.slow — the full-evaluator subprocess exceeded 900s timeout on CI after Phase 80/81 enabled more enable_* flags on scenarios
Action items: None — all exit criteria met