Phase 14: Tooling & Developer Experience¶
Overview¶
Phase 14 adds developer tooling: a Claude Code MCP server, analysis utilities, visualization tools, and 6 new Claude skills. All new code lives in stochastic_warfare/tools/ — purely additive, no modifications to existing simulation code.
Test count: 125 new tests (4,372 total passing)
Sub-phases¶
14a: MCP Server (36 tests)¶
tools/__init__.py— Package inittools/serializers.py— JSON serialization for numpy, datetime, enum, Position, inf/nan, dataclasses, pydantic modelstools/result_store.py— LRU cache (max 20) for run results withstore/get/latest/list_runs/cleartools/mcp_server.py— FastMCP server with 7 tools:run_scenario,query_state,run_monte_carlo,compare_results,list_scenarios,list_units,modify_parametertools/mcp_resources.py— 3 resource providers:scenario://{name}/config,unit://{category}/{type},result://{run_id}
Key decisions:
- asyncio.to_thread() for blocking simulation calls
- All tools return JSON; errors return {"error": true, "error_type": "...", "message": "..."}
- mcp[cli]>=1.2.0 as optional dependency (uv sync --extra mcp)
- Console script entry point: stochastic-warfare-mcp
14b: Analysis Tools (63 tests)¶
tools/narrative.py— Registry-based template system with ~15 built-in formatters for event types.generate_narrative()groups events by tick,format_narrative()supports full/summary/timeline styles.tools/tempo_analysis.py— FFT spectral analysis of event frequency by 5 categories (Combat, Detection, C2, Morale, Movement). OODA cycle timing extraction fromOODAPhaseChangeEventsequences. 3-panel plot (time series, FFT spectrum, OODA boxplot).tools/comparison.py— A/B statistical comparison using Mann-Whitney U test with rank-biserial effect size.compare_distributions()for direct use,run_comparison()for full scenario-based comparison.tools/sensitivity.py— Parameter sweep over calibration overrides. Same seed sequence at every point. Errorbar plot output.tools/_run_helpers.py— Shared batch runner used by comparison and sensitivity modules. Temp YAML pattern fromCampaignRunner.
14c: Visualization (26 tests)¶
tools/charts.py— 6 chart functions:force_strength_chart,engagement_network,supply_flow_diagram,engagement_timeline,morale_progression,mc_distribution_grid. All returnmatplotlib.figure.Figure, noplt.show().tools/replay.py— Animated battle replay viaFuncAnimation.extract_replay_frames()from snapshot data,create_replay()with side-colored scatter plots and engagement lines,save_replay()to GIF/MP4.
Key decisions:
- matplotlib.use("Agg") at module level to avoid Tk backend issues on headless/Windows
- networkx graph for engagement network visualization (already a dependency)
14d: Claude Skills (no tests)¶
6 new skill files in .claude/skills/:
- /scenario — Interactive scenario creation/editing walkthrough
- /compare — Run two configs and summarize with statistical interpretation
- /what-if — Quick parameter sensitivity from natural language questions
- /timeline — Generate narrative from simulation run
- /orbat — Interactive order of battle builder
- /calibrate — Auto-tune calibration overrides to match historical metrics
pyproject.toml Changes¶
- Added
mcp = ["mcp[cli]>=1.2.0"]optional dependency group - Added
stochastic-warfare-mcpconsole script entry point
Files Created¶
| File | Lines | Purpose |
|---|---|---|
tools/__init__.py |
1 | Package init |
tools/serializers.py |
92 | JSON serialization |
tools/result_store.py |
80 | LRU result cache |
tools/mcp_server.py |
310 | MCP server + 7 tools |
tools/mcp_resources.py |
72 | MCP resource providers |
tools/narrative.py |
240 | Battle narrative generation |
tools/tempo_analysis.py |
270 | FFT tempo analysis |
tools/comparison.py |
145 | A/B statistical comparison |
tools/sensitivity.py |
130 | Parameter sweep |
tools/_run_helpers.py |
165 | Shared batch runner |
tools/charts.py |
230 | 6 chart functions |
tools/replay.py |
220 | Animated replay |
| 7 skill SKILL.md files | ~150 each | Claude skill templates |
| 7 test files | 125 tests | Full test coverage |
Lessons Learned¶
- IntEnum vs Enum serialization:
IntEnumsubclassesint, so theisinstance(obj, int)check fires beforeisinstance(obj, enum.Enum). Must check enum first. - matplotlib Agg backend: On Windows without proper Tk installation,
matplotlib.pyplotfails to create figures. Settingmatplotlib.use("Agg")at module level avoids the issue. - Mann-Whitney U with identical values:
scipy.stats.mannwhitneyuraisesValueErrorwhen all values are identical. Must catch and return p=1.0. - Temp YAML pattern reuse: The
CampaignRunnerpattern of writing temp YAML forScenarioLoaderworks well for parameter sweeps and comparisons. - No simulation code modified: Phase 14 is purely additive — all 4,247 existing tests continue to pass unchanged.
Postmortem¶
1. Delivered vs Planned¶
- Scope: On target. All 4 sub-phases delivered as planned (MCP server, analysis tools, visualization, skills).
- Unplanned additions:
/postmortemskill created during postmortem process. - No items dropped or deferred.
2. Integration Audit¶
- Critical fix:
mcp_resources.pyregister_resources()was never called from_create_server()— dead code. Fixed: wired into_create_server(). - All other modules properly imported and tested.
- All 6 new skills listed in both CLAUDE.md and
docs/skills-and-hooks.md. /postmortemskill added to both locations.
3. Test Quality Review¶
- 7 test files covering all source modules.
- Integration tests verify run→query and run→compare chains.
- Resource provider tests added during postmortem (7 tests: valid/missing for each provider).
- Tests use fast paths (max_ticks=5, mock data) — no
@pytest.mark.slowneeded.
4. API Surface Check¶
- Fixed:
_run_singlereturn type annotation said-> dict[str, Any]but returnedtuple[dict, Any, Any]. Corrected to-> tuple[dict[str, Any], Any, Any]. - Fixed:
max_workersparameter accepted but unused (no-op). Removed from_tool_run_monte_carloand async wrapper. - All public functions have type hints.
get_logger(__name__)used consistently.
5. Deficit Discovery¶
- Fixed:
set()used in_tool_compare_resultsviolating deterministic iteration convention. Replaced withsorted(dict.fromkeys(...)). - Fixed: Magic numbers (
[:500],[:100],max_size=20) extracted to named constants_MAX_STORE_SIZE,_MAX_STORED_EVENTS,_MAX_QUERY_EVENTS. - Fixed:
charts.pysupply threshold0.2hardcoded twice. Extracted to_SUPPLY_CRITICAL_THRESHOLD. - Minor (accepted):
_run_helpers.pyfragiledata_dirderivation viaPath(scenario_path).parent.parent. Acceptable since scenario paths always followdata/scenarios/{name}/scenario.yamlconvention.
6. Documentation Freshness¶
- All lockstep docs updated: CLAUDE.md, project-structure.md, development-phases-post-mvp.md, devlog/index.md, README.md, MEMORY.md, skills-and-hooks.md.
/postmortemskill added to CLAUDE.md skill table and skills-and-hooks.md.
7. Performance Sanity¶
- Phase 14 tests: 125 tests in ~2.1s. No performance regression.
- Full suite: 4,372 tests passing.
8. Summary¶
- Scope: On target
- Quality: High — all critical issues found and fixed
- Integration: Fully wired (after postmortem fix)
- Deficits: 0 new (all found items resolved in-phase)
- Action items: None — all issues resolved