Skip to content

Spec-driven development in CI

The agent is at its best when a written specification— not a vague comment— is the source of truth. You describe what to build and how it will be judged; the agent implements against that spec, and a second, independent agent reviews the result against the same spec. The spec is the contract both sides are held to.

flowchart LR
  S["Spec"]
  A["Agent<br/>(writes)"]
  R["Advisor<br/>(reviews)"]
  M["PR / MR"]
  S --> A --> M --> R
  S --> R
  R -->|gaps| A
  R -->|meets spec| H["Human merges"]

Why a spec

  • Reviewable intent. A spec is diffable and versioned— changes to what we want show up in history next to changes to what we built.
  • An objective bar. The Advisor reviews against acceptance criteria you wrote, not against its own taste— so "looks fine to me" becomes "meets / misses spec item 3".
  • Separation of duties. The Agent that writes the code never decides whether it passes— a different run (ideally a different model or vendor) judges it.

1. Write the spec

Keep specs in the repo so they are versioned, reviewed, and reachable by the agent. A simple convention is one file per feature under spec/. The spec can be authored by hand— or generated from a Jira issue's acceptance criteria when Jira is your tracker (see Triggering from Jira below).

The runnable GitLab example in examples/gitlab/claude-ci-agent-test/ ships this spec/feature01.md — a small, self-contained feature whose criteria are concrete enough for the Advisor to grade pass/fail:

# spec/feature01.md — IP Config Scraper

## Goal
A single-file Python script that fetches IP / network config from a URL
(e.g. https://httpbin.org/ip) and pretty-prints the result.

## Acceptance criteria
- [ ] A single script fetches a target URL, defaulting to a reliable JSON IP API.
- [ ] The target URL is overridable via a command-line argument.
- [ ] JSON responses are pretty-printed; non-JSON responses are printed as text.
- [ ] Timeouts, network failures, and invalid JSON exit non-zero with a clear
      error — no traceback.
- [ ] Dependencies are declared inline via PEP 723 so `uv run` needs no install.
- [ ] A unit test covers a success and a failure path, with the network mocked.

## Out of scope
- Persisting results; concurrent/async fetching; any web server.

## Constraints
- Follow CLAUDE.MD coding standards. No third-party dependencies beyond `httpx`.

Acceptance criteria are the review rubric

Write them as checkboxes the Advisor can tick off one by one. Vague goals give vague reviews; testable criteria give a pass/fail verdict.

2. Implement from the spec (Agent personality)

Trigger the read-write Agent and point its task at the spec file rather than describing the work inline. This is exactly what the example's .gitlab-ci.yml does — the prompt names the spec, and a rules: override gates the agent behind a manual click so it only runs when you ask:

include:
  - component: $CI_SERVER_FQDN/<group>/claude-ci-agent/claude-agent@v0.1.0-alpha.13
    inputs:
      prompt: >-
        Implement the specification in spec/feature01.md exactly. Satisfy every
        acceptance criterion, add the tests it requires, and follow CLAUDE.MD.
        Do not implement anything listed under "Out of scope".
      model: "claude-sonnet-4-6"

# Gate the implementer behind a manual click (omit to run it automatically).
claude-agent:
  rules:
    - when: manual
      allow_failure: false

The Agent runs claude inside a fresh rootless-Podman sandbox off the pinned image (only the working tree is mounted in), makes atomic commits, and opens a new branch / MR— it never pushes to the default branch (see Personalities & triggers). The same prompt input works for the GitHub Action if GitHub is your SCM.

3. Review against the spec (Advisor personality)

The component already ships a claude-advisor that auto-runs on every merge request. When you want to grade specifically against the spec — and on a cheaper model than the implementer — define a custom advisor job. The example does exactly this: it extends: .claude-base (the component's hidden template that resolves secrets and starts the OTel sidecar) and supplies a spec-graded prompt:

claude-agent-advisor:
  extends: .claude-base          # inherits secret setup + OTel sidecar
  variables:
    CLAUDE_MODEL: "claude-haiku-4-5"   # cheaper/faster reviewer than the agent
  script:
    - |
      claude -p "You are the ADVISOR (read-only). Review this change \
      against spec/feature01.md. For EACH acceptance criterion, state PASS or FAIL \
      with file:line evidence. Run the tests and linters. List any criterion not \
      met, any out-of-scope work, and any CLAUDE.MD violations. Write the verdict \
      to review.md. You MUST NOT modify, commit, or push any files." \
        --model "$CLAUDE_MODEL" --dangerously-skip-permissions \
        --output-format json > claude-result.json
    - test -f review.md || echo "Advisor produced no review.md." > review.md
    - cat review.md
  artifacts: { when: always, paths: [review.md, claude-result.json] }
  # Auto-run only on the agent's own merge requests (branch `claude/task-<id>`),
  # so the spec-graded review fires on agent MRs but not human-authored ones.
  rules:
    - if: '$CI_PIPELINE_SOURCE == "merge_request_event" && $CI_MERGE_REQUEST_SOURCE_BRANCH_NAME =~ /^claude\/task-/'

$[[ inputs.* ]] only interpolates inside the component

Component-input interpolation (e.g. $[[ inputs.claude_args ]]) works only within the component template itself, not in your consuming .gitlab-ci.yml. Used here it would reach the CLI verbatim and fail the job — pass a literal flag instead. This is the one gotcha the example exists to demonstrate.

Because the Advisor holds no write token, it cannot "fix and approve" its own finding— it can only report. The verdict lands in review.md (artifact) and, for the built-in claude-advisor, as an MR note; the per-run cost lands in telemetry.

Independent reviewer

Run the Advisor on a different model or vendor than the Agent so the reviewer doesn't share the implementer's blind spots. See Different models for creation vs review.

4. Close the loop

The review feeds back into the same spec-driven cycle:

  1. Advisor reports FAIL on a criterion → comment @claude address review.md findings for spec/feature01.md to re-trigger the Agent on the existing branch.
  2. Spec was wrong, not the code → edit spec/feature01.md, and both the next implementation and the next review track the change automatically.
  3. All criteria PASS → a human merges. The agent never self-approves a merge.
flowchart LR
  W["Write / refine spec"] --> I["Agent implements"]
  I --> V["Advisor grades vs spec"]
  V -->|FAIL| I
  V -->|"spec wrong"| W
  V -->|PASS| H["Human merges"]

Example configurations: simple → advanced → full loop

Three ready-to-copy GitLab configs, each a superset of the one before. All drive the same in-repo spec/feature01.md through the claude-agent component — pick the tier that matches how much automation you want.

Required CI/CD variables (all tiers; Settings → CI/CD → Variables, mask + protect): ANTHROPIC_API_KEY, and a GITLAB_TOKEN with the write_repository and api scopes and at least the Developer role (the agent pushes the branch and opens the MR; the advisor posts the note). A read-only token gets a 403 on push. Optionally set ELASTIC_OTLP_ENDPOINT / ELASTIC_OTLP_AUTHORIZATION to stream the per-run cost and secret-scrubbed audit trail to Elastic.

Simple — one include:

The component ships both personalities, so a single include is the whole pipeline: claude-advisor auto-runs on every merge request; claude-agent runs whenever the prompt (or $CLAUDE_TASK) is non-empty.

stages:
  - test

include:
  - component: $CI_SERVER_FQDN/<group>/claude-ci-agent/claude-agent@v0.1.0-alpha.13
    inputs:
      # The AGENT implements this spec on a new branch + MR. Runs only when this
      # is non-empty (or a CLAUDE_TASK pipeline variable is supplied ad-hoc).
      prompt: >-
        Implement the specification in spec/feature01.md exactly. Satisfy every
        acceptance criterion, add the tests it requires, and follow CLAUDE.MD.
        Do not implement anything listed under "Out of scope".
      # Independent reviewer: grade with a different model than the agent writes with.
      model: "claude-sonnet-4-6"

What you get from the one include:

Job Runs when Does
claude-agent prompt (or $CLAUDE_TASK) is non-empty Implements spec/feature01.md, commits, opens a new MR
claude-advisor the resulting merge request opens / updates Grades the diff against the spec, posts the verdict as an MR note

Advanced — manual gate + spec-graded reviewer

The runnable examples/gitlab/claude-ci-agent-test/ builds on the simple include with two changes: it gates the implementer behind a manual click, and adds a custom claude-agent-advisor that grades against the spec on a cheaper model. The advisor extends: .claude-base, so it inherits secret resolution and the OTel sidecar from the component:

# …the same include: as above, then:

claude-agent:                      # gate the implementer behind a click
  rules:
    - when: manual
      allow_failure: false

claude-agent-advisor:              # spec-graded reviewer on a cheaper model
  extends: .claude-base
  variables:
    CLAUDE_MODEL: "claude-haiku-4-5"
  script:
    - |
      claude -p "ADVISOR (read-only): grade this change against \
      spec/feature01.md — PASS/FAIL per criterion with file:line evidence; run the \
      tests; write the verdict to review.md; do NOT modify, commit, or push." \
        --model "$CLAUDE_MODEL" --dangerously-skip-permissions \
        --output-format json > claude-result.json
    - test -f review.md || echo "Advisor produced no review.md." > review.md
    - cat review.md
  artifacts:
    when: always
    paths: [review.md, claude-result.json]
  # Auto-review only the agent's own MRs — its branches are `claude/task-<id>`.
  rules:
    - if: '$CI_PIPELINE_SOURCE == "merge_request_event" && $CI_MERGE_REQUEST_SOURCE_BRANCH_NAME =~ /^claude\/task-/'

Clone the example, set the two variables, and click claude-agent. It implements the spec and opens an MR from a claude/task-<id> branch — which is exactly what the advisor's rule matches, so the spec-graded review runs automatically on that MR (and stays off human-authored MRs). See its README to run it. (Note the literal --dangerously-skip-permissions flag rather than $[[ inputs.claude_args ]] — see the interpolation warning in step 3.)

Branch prefix and the advisor rule are coupled

The claude/task- prefix is the component's default branch_prefix input. If you override it (e.g. inputs.branch_prefix: "ai/"), update the advisor's =~ /^claude\/task-/ regex to match, or the review will stop firing on the agent's MRs.

Full loop — tracker-driven, closes itself

The most automated tier: the Jira issue is the spec, its status drives the cycle, and the Advisor's verdict flows back onto the ticket — implement → review → re-trigger on FAIL → human merge, with nobody hand-editing YAML per feature. See Triggering from Jira below for the full wiring.

Triggering from Jira (with GitLab)

When GitLab is your SCM and Jira is your tracker, the Jira issue is the spec— its description and acceptance criteria are the contract— and the issue's status drives the loop. Moving a ticket to an "AI-Ready" state kicks off the Agent; the Advisor's verdict flows back onto the ticket. Nothing new is needed in the agent itself— only a trigger and two small API calls.

sequenceDiagram
  participant J as Jira issue (PROJ-123)
  participant GL as GitLab pipeline
  participant AG as Agent (read-write)
  participant AD as Advisor (read-only)
  participant H as Human
  J->>GL: status → "AI-Ready"<br/>Automation: web request → pipeline trigger
  GL->>AG: run with JIRA_ISSUE_KEY=PROJ-123
  AG->>J: fetch summary + acceptance criteria (REST)
  AG->>GL: implement → branch claude/PROJ-123 → open MR
  GL->>AD: MR opened → review vs the issue's criteria
  AD->>J: comment verdict (PASS/FAIL per criterion)
  AD->>J: transition (e.g. "In Review" / "Changes Requested")
  H->>J: merge MR → Smart Commit closes the issue

1. Jira side— fire on a status change

Add a Jira Automation rule: When issue transitions to AI-Ready (or gets a claude label), Then Send web request to GitLab's pipeline-trigger API, passing the issue key as a variable:

POST https://gitlab.example.com/api/v4/projects/<PROJECT_ID>/trigger/pipeline
form-encoded:
  token = {{ GitLab trigger token }}          # store in Jira's secret, not inline
  ref   = main
  variables[JIRA_ISSUE_KEY] = {{ issue.key }}
  variables[CLAUDE_TASK]    = Implement {{ issue.key }} from its acceptance criteria

CLAUDE_TASK makes the existing Agent job rule (if: $CLAUDE_TASK) match— no pipeline change required to start.

In the Agent job, turn the Jira issue into the in-repo spec the loop already expects, then name the branch with the issue key so GitLab's Jira integration auto-links the MR to the ticket:

script:
  # Pull the issue → spec/<KEY>.md (JIRA_URL/JIRA_TOKEN from CI vars or OpenBao).
  # Jira Cloud returns the description as ADF (JSON); flatten its text nodes.
  - |
    curl -sS -H "Authorization: Bearer $JIRA_TOKEN" \
      "$JIRA_URL/rest/api/3/issue/$JIRA_ISSUE_KEY?fields=summary,description" \
      | python3 -c '
    import sys, json
    d = json.load(sys.stdin)["fields"]
    def text(node):
        if isinstance(node, dict):
            return node.get("text", "") + "".join(text(c) for c in node.get("content", []))
        return "".join(text(c) for c in node) if isinstance(node, list) else ""
    print("# %s\n\n%s" % (d["summary"], text(d.get("description") or {})))
    ' > "spec/$JIRA_ISSUE_KEY.md"
  - |
    claude -p "Implement spec/$JIRA_ISSUE_KEY.md exactly. Satisfy every acceptance \
    criterion, add the tests it requires, follow CLAUDE.MD." \
      --model "$CLAUDE_MODEL" --permission-mode bypassPermissions \
      --dangerously-skip-permissions
  # Branch + MR carry the issue key so Jira and GitLab cross-link automatically.
  - |
    git push "https://oauth2:${GIT_PUSH_TOKEN}@${CI_SERVER_HOST}/${CI_PROJECT_PATH}.git" \
      "HEAD:claude/${JIRA_ISSUE_KEY}" \
      -o merge_request.create \
      -o merge_request.title="${JIRA_ISSUE_KEY} implement from spec"

3. Report the verdict back onto the ticket

When the Advisor finishes its review (MR-open trigger), post the verdict to the issue and move it— so the loop is visible to non-engineers in Jira, not just in GitLab:

# In the Advisor job, after review.md is written:
- |
  jq -Rs '{body: {type:"doc", version:1, content:[{type:"paragraph",
    content:[{type:"text", text: .}]}]}}' review.md \
    | curl -sS -X POST -H "Authorization: Bearer $JIRA_TOKEN" \
        -H "Content-Type: application/json" \
        "$JIRA_URL/rest/api/3/issue/$JIRA_ISSUE_KEY/comment" -d @-
# …and transition, e.g. PASS → "In Review", FAIL → "Changes Requested":
- |
  curl -sS -X POST -H "Authorization: Bearer $JIRA_TOKEN" \
    -H "Content-Type: application/json" \
    "$JIRA_URL/rest/api/3/issue/$JIRA_ISSUE_KEY/transitions" \
    -d "{\"transition\":{\"id\":\"$JIRA_TRANSITION_ID\"}}"

How the loop closes through Jira

  • FAIL → the Advisor's comment lands on PROJ-123; a reviewer (or a follow-up @claude comment) re-triggers the Agent on the same branch. The ticket sits in "Changes Requested" until the next review passes.
  • Spec wrong, not the code → edit the issue's acceptance criteria and move it back to AI-Ready; the regenerated spec/PROJ-123.md tracks the change.
  • PASS → a human merges the MR. A Smart Commit (PROJ-123 #close) in the merge transitions the issue to Done— the agent never self-merges or self-closes.

Credentials stay zero-trust

The GitLab trigger token is scoped to one project; the Jira API token and GIT_PUSH_TOKEN come from CI variables or, better, the OpenBao addon at run time— nothing long-lived is baked into Jira or the pipeline. The read-only Advisor gets the Jira token to comment but no GIT_PUSH_TOKEN, so it still cannot change code.

Guardrails that make this safe

  • The spec is in the repo, so every run sees the same contract and its history.
  • Separation of duties— writer and reviewer are different runs with different tokens; only the writer can change code, only humans can merge.
  • Everything is audited. Each implement/review run streams secret-scrubbed OTLP events— and its Anthropic cost— to Elastic, so spec-driven work is fully attributable per feature, branch, and personality.
  • Bypass-permissions stays safe because the agent runs fully contained— the spec loop never lowers the sandbox boundary.