Skip to content

Phase 14: Tooling & Developer Experience

Overview

Phase 14 adds developer tooling: a Claude Code MCP server, analysis utilities, visualization tools, and 6 new Claude skills. All new code lives in stochastic_warfare/tools/ — purely additive, no modifications to existing simulation code.

Test count: 125 new tests (4,372 total passing)

Sub-phases

14a: MCP Server (36 tests)

  • tools/__init__.py — Package init
  • tools/serializers.py — JSON serialization for numpy, datetime, enum, Position, inf/nan, dataclasses, pydantic models
  • tools/result_store.py — LRU cache (max 20) for run results with store/get/latest/list_runs/clear
  • tools/mcp_server.py — FastMCP server with 7 tools: run_scenario, query_state, run_monte_carlo, compare_results, list_scenarios, list_units, modify_parameter
  • tools/mcp_resources.py — 3 resource providers: scenario://{name}/config, unit://{category}/{type}, result://{run_id}

Key decisions: - asyncio.to_thread() for blocking simulation calls - All tools return JSON; errors return {"error": true, "error_type": "...", "message": "..."} - mcp[cli]>=1.2.0 as optional dependency (uv sync --extra mcp) - Console script entry point: stochastic-warfare-mcp

14b: Analysis Tools (63 tests)

  • tools/narrative.py — Registry-based template system with ~15 built-in formatters for event types. generate_narrative() groups events by tick, format_narrative() supports full/summary/timeline styles.
  • tools/tempo_analysis.py — FFT spectral analysis of event frequency by 5 categories (Combat, Detection, C2, Morale, Movement). OODA cycle timing extraction from OODAPhaseChangeEvent sequences. 3-panel plot (time series, FFT spectrum, OODA boxplot).
  • tools/comparison.py — A/B statistical comparison using Mann-Whitney U test with rank-biserial effect size. compare_distributions() for direct use, run_comparison() for full scenario-based comparison.
  • tools/sensitivity.py — Parameter sweep over calibration overrides. Same seed sequence at every point. Errorbar plot output.
  • tools/_run_helpers.py — Shared batch runner used by comparison and sensitivity modules. Temp YAML pattern from CampaignRunner.

14c: Visualization (26 tests)

  • tools/charts.py — 6 chart functions: force_strength_chart, engagement_network, supply_flow_diagram, engagement_timeline, morale_progression, mc_distribution_grid. All return matplotlib.figure.Figure, no plt.show().
  • tools/replay.py — Animated battle replay via FuncAnimation. extract_replay_frames() from snapshot data, create_replay() with side-colored scatter plots and engagement lines, save_replay() to GIF/MP4.

Key decisions: - matplotlib.use("Agg") at module level to avoid Tk backend issues on headless/Windows - networkx graph for engagement network visualization (already a dependency)

14d: Claude Skills (no tests)

6 new skill files in .claude/skills/: - /scenario — Interactive scenario creation/editing walkthrough - /compare — Run two configs and summarize with statistical interpretation - /what-if — Quick parameter sensitivity from natural language questions - /timeline — Generate narrative from simulation run - /orbat — Interactive order of battle builder - /calibrate — Auto-tune calibration overrides to match historical metrics

pyproject.toml Changes

  • Added mcp = ["mcp[cli]>=1.2.0"] optional dependency group
  • Added stochastic-warfare-mcp console script entry point

Files Created

File Lines Purpose
tools/__init__.py 1 Package init
tools/serializers.py 92 JSON serialization
tools/result_store.py 80 LRU result cache
tools/mcp_server.py 310 MCP server + 7 tools
tools/mcp_resources.py 72 MCP resource providers
tools/narrative.py 240 Battle narrative generation
tools/tempo_analysis.py 270 FFT tempo analysis
tools/comparison.py 145 A/B statistical comparison
tools/sensitivity.py 130 Parameter sweep
tools/_run_helpers.py 165 Shared batch runner
tools/charts.py 230 6 chart functions
tools/replay.py 220 Animated replay
7 skill SKILL.md files ~150 each Claude skill templates
7 test files 125 tests Full test coverage

Lessons Learned

  • IntEnum vs Enum serialization: IntEnum subclasses int, so the isinstance(obj, int) check fires before isinstance(obj, enum.Enum). Must check enum first.
  • matplotlib Agg backend: On Windows without proper Tk installation, matplotlib.pyplot fails to create figures. Setting matplotlib.use("Agg") at module level avoids the issue.
  • Mann-Whitney U with identical values: scipy.stats.mannwhitneyu raises ValueError when all values are identical. Must catch and return p=1.0.
  • Temp YAML pattern reuse: The CampaignRunner pattern of writing temp YAML for ScenarioLoader works well for parameter sweeps and comparisons.
  • No simulation code modified: Phase 14 is purely additive — all 4,247 existing tests continue to pass unchanged.

Postmortem

1. Delivered vs Planned

  • Scope: On target. All 4 sub-phases delivered as planned (MCP server, analysis tools, visualization, skills).
  • Unplanned additions: /postmortem skill created during postmortem process.
  • No items dropped or deferred.

2. Integration Audit

  • Critical fix: mcp_resources.py register_resources() was never called from _create_server() — dead code. Fixed: wired into _create_server().
  • All other modules properly imported and tested.
  • All 6 new skills listed in both CLAUDE.md and docs/skills-and-hooks.md.
  • /postmortem skill added to both locations.

3. Test Quality Review

  • 7 test files covering all source modules.
  • Integration tests verify run→query and run→compare chains.
  • Resource provider tests added during postmortem (7 tests: valid/missing for each provider).
  • Tests use fast paths (max_ticks=5, mock data) — no @pytest.mark.slow needed.

4. API Surface Check

  • Fixed: _run_single return type annotation said -> dict[str, Any] but returned tuple[dict, Any, Any]. Corrected to -> tuple[dict[str, Any], Any, Any].
  • Fixed: max_workers parameter accepted but unused (no-op). Removed from _tool_run_monte_carlo and async wrapper.
  • All public functions have type hints.
  • get_logger(__name__) used consistently.

5. Deficit Discovery

  • Fixed: set() used in _tool_compare_results violating deterministic iteration convention. Replaced with sorted(dict.fromkeys(...)).
  • Fixed: Magic numbers ([:500], [:100], max_size=20) extracted to named constants _MAX_STORE_SIZE, _MAX_STORED_EVENTS, _MAX_QUERY_EVENTS.
  • Fixed: charts.py supply threshold 0.2 hardcoded twice. Extracted to _SUPPLY_CRITICAL_THRESHOLD.
  • Minor (accepted): _run_helpers.py fragile data_dir derivation via Path(scenario_path).parent.parent. Acceptable since scenario paths always follow data/scenarios/{name}/scenario.yaml convention.

6. Documentation Freshness

  • All lockstep docs updated: CLAUDE.md, project-structure.md, development-phases-post-mvp.md, devlog/index.md, README.md, MEMORY.md, skills-and-hooks.md.
  • /postmortem skill added to CLAUDE.md skill table and skills-and-hooks.md.

7. Performance Sanity

  • Phase 14 tests: 125 tests in ~2.1s. No performance regression.
  • Full suite: 4,372 tests passing.

8. Summary

  • Scope: On target
  • Quality: High — all critical issues found and fixed
  • Integration: Fully wired (after postmortem fix)
  • Deficits: 0 new (all found items resolved in-phase)
  • Action items: None — all issues resolved