SAFe Feature Spec System — Richard Thomchick

What it does

Accepts raw stakeholder feature requests and produces polished SAFe feature specifications through a six-agent pipeline. The system handles three feature types (CAPABILITY, EXPERIENCE, WEBPAGE) with type-specific rubric interpretation, scores output on a 100-point scale across 9 sections, and iteratively improves weak sections until quality targets are met.

Router (Haiku) — Classifies feature requests by type. Fast and cheap for a decision that gates the rest of the pipeline.

Draft Answerer (Sonnet) — Proposes structured answers from PM notes, extracting what’s explicitly stated without inventing.

Generator (Sonnet) — Produces the full SAFe Feature spec. Consumes ~67% of pipeline cost — the most expensive but most value-creating agent.

Reviewer (Sonnet) — Scores the spec on a 100-point rubric across 9 sections. PM boost inputs (domain context) are injected here, post-generation but pre-improvement.

Improver (Sonnet) — Rewrites only the sections scoring below 75%. Uses Parse → Edit → Reassemble architecture to prevent structural corruption.

Polish (Sonnet) — Auto-triggers when the overall score lands in the 80-89 range, pushing specs past the 90-point threshold.

Architecture decisions

The linear pipeline (not hub-and-spoke) reflects the actual dependency chain: you can’t review what hasn’t been generated, and you can’t improve what hasn’t been reviewed. Each agent’s output is the next agent’s input.

Two conditional loops define the quality system: the Reviewer → Improver loop fires for sections below 75%, and the Improver → Polish loop auto-triggers at 80-89 to break through score ceilings. Without the two-tier approach, specs reliably plateau in the mid-80s.

Section-isolated re-scoring prevents score drift — when the Improver rewrites Section 3, the Reviewer carries forward original scores for unchanged sections rather than re-evaluating the entire spec. This eliminates a class of bugs where untouched sections mysteriously change scores between runs.

The v3 upgrade (Week 11) added PostgreSQL migration for concurrent multi-surface access, a ConnectorInterface abstraction for swapping between local SQLite and Supabase backends, and full governance module integration (cost guardrails, grounding checks, audit trail, prompt governance) running inline during normal pipeline execution.

What I learned

The biggest lesson: you can’t improve what you can’t measure. Building the golden test set and evaluation infrastructure (Week 9) before attempting improvements meant every prompt change had a clear before/after comparison. The ±7-point LLM-as-judge variance means single runs are unreliable — you need multiple runs across the full golden set to trust a comparison.

The Parse → Edit → Reassemble pattern was born from debugging structural corruption. Positional string surgery (find section header, splice in new content) causes downstream section corruption when section lengths change. Parsing the spec into a section dictionary, editing individual entries, and reassembling eliminates the entire class of bugs.