Skip to main content
← Projects

SAFe Feature Spec System

live

Six-agent pipeline that classifies, drafts, generates, reviews, improves, and polishes SAFe feature specifications with a 100-point rubric scoring system.

Problem solved
The original Feature Spec Generator worked but couldn't handle different feature types, improve its own output, or prove its quality. This system adds classification, self-evaluation, iterative improvement, and production governance.
Architecture
Router → Draft Answerer → Generator → Reviewer → Improver → Polish (with conditional improvement loops)
Tech stack
PythonStreamlitAnthropic APIPostgreSQLSupabase
Week built
Week 8

What it does

Accepts raw stakeholder feature requests and produces polished SAFe feature specifications through a six-agent pipeline. The system handles three feature types (CAPABILITY, EXPERIENCE, WEBPAGE) with type-specific rubric interpretation, scores output on a 100-point scale across 9 sections, and iteratively improves weak sections until quality targets are met.

Router (Haiku) — Classifies feature requests by type. Fast and cheap for a decision that gates the rest of the pipeline.

Draft Answerer (Sonnet) — Proposes structured answers from PM notes, extracting what’s explicitly stated without inventing.

Generator (Sonnet) — Produces the full SAFe Feature spec. Consumes ~67% of pipeline cost — the most expensive but most value-creating agent.

Reviewer (Sonnet) — Scores the spec on a 100-point rubric across 9 sections. PM boost inputs (domain context) are injected here, post-generation but pre-improvement.

Improver (Sonnet) — Rewrites only the sections scoring below 75%. Uses Parse → Edit → Reassemble architecture to prevent structural corruption.

Polish (Sonnet) — Auto-triggers when the overall score lands in the 80-89 range, pushing specs past the 90-point threshold.

Architecture decisions

The linear pipeline (not hub-and-spoke) reflects the actual dependency chain: you can’t review what hasn’t been generated, and you can’t improve what hasn’t been reviewed. Each agent’s output is the next agent’s input.

Two conditional loops define the quality system: the Reviewer → Improver loop fires for sections below 75%, and the Improver → Polish loop auto-triggers at 80-89 to break through score ceilings. Without the two-tier approach, specs reliably plateau in the mid-80s.

Section-isolated re-scoring prevents score drift — when the Improver rewrites Section 3, the Reviewer carries forward original scores for unchanged sections rather than re-evaluating the entire spec. This eliminates a class of bugs where untouched sections mysteriously change scores between runs.

The v3 upgrade (Week 11) added PostgreSQL migration for concurrent multi-surface access, a ConnectorInterface abstraction for swapping between local SQLite and Supabase backends, and full governance module integration (cost guardrails, grounding checks, audit trail, prompt governance) running inline during normal pipeline execution.

What I learned

The biggest lesson: you can’t improve what you can’t measure. Building the golden test set and evaluation infrastructure (Week 9) before attempting improvements meant every prompt change had a clear before/after comparison. The ±7-point LLM-as-judge variance means single runs are unreliable — you need multiple runs across the full golden set to trust a comparison.

The Parse → Edit → Reassemble pattern was born from debugging structural corruption. Positional string surgery (find section header, splice in new content) causes downstream section corruption when section lengths change. Parsing the spec into a section dictionary, editing individual entries, and reassembling eliminates the entire class of bugs.