Blog

ScaffBench 2.1: Claude Sonnet 5 vs Claude Opus 4.8 at scaffolding from a prompt

Claude Sonnet 5 enters ScaffBench 2.1, tested against Opus 4.8 (low + max effort) on 13 hard full-stack specs: can a model scaffold a building project from a prompt?

July 1, 2026 · Ibrahim Elkamali

benchmarkclaude-codeagents

ScaffBench 2 asked the hardest version of one question — can a coding agent hand-write a real full-stack project from a prompt, with no scaffolder, and have it build? The answer was "rarely."

ScaffBench 2.1 puts the newest model, Claude Sonnet 5, to that test — head-to-head with Claude Opus 4.8 at both low and maximum reasoning effort, on an expanded suite of 13 hard specs across eight ecosystems. The prompt path is the purest measure of raw capability: the agent reads the spec and writes every file itself.

Opus 4.8 has finished the full 13-spec run at both efforts. Sonnet 5's max run is still generating, so its numbers below are the nine specs validated so far. Here's where things stand.

Leaderboard

CORE pass@1 = the project installs, builds, type-checks, and native-compiles from a prompt. Higher is better; cost and steps are per project.

ConfigCORE pass@1Avg costOut tokensTool stepsWired libs
Claude Opus 4.8 · low3 / 11 (27%)$1.6321.5k3896%
Claude Opus 4.8 · max3 / 11 (27%)$9.21148k9997%
Claude Sonnet 5 · max1 / 8 (13%)$4.99111k23097%
CORE build pass@1 · prompt-only
Opus 4.8 · low
3 / 11 · 27%
Opus 4.8 · max
3 / 11 · 27%
Sonnet 5 · max
1 / 8 · 13%

Sonnet 5 = 9 of 13 specs so far (run in progress). Inconclusive specs excluded — see The honest part.

Claude Sonnet 5: strong instincts, fewer builds

Sonnet 5 selects the right stack as well as Opus — 97% wired, matching it spec-for-spec — but it turns fewer of those into projects that compile: 1 of 8, versus Opus's 3 of 11. The one spec both reliably build is python-ingestion-api; the dense multi-service specs defeat both.

SpecEcosystemOpus 4.8 · maxSonnet 5 · max
python-ingestion-apiPython✅ builds✅ builds
rust-leptos-axumRust✅ builds❌ no build
java-spring-jooq-keycloakJava✅ builds❌ no build
ai-search-workbenchTypeScript❌ no build❌ no build
go-realtime-apiGo❌ no build❌ no build
multi-dotnet-opsMulti (TS + .NET)❌ no build❌ no build
ts-svelte-edge-orpcTypeScript❌ timeout❌ no build
dotnet-blazor-cqrs.NET❌ no build❌ no build
multi-ts-go-grpcMulti (TS + Go)⚪ inconclusive⚪ inconclusive

Opus uniquely landed two builds Sonnet couldn't — rust-leptos-axum and java-spring-jooq-keycloak.

Sonnet 5 is also the busiest run in the field. It averaged 230 tool steps per project — more than double Opus at max (99) and 6× Opus at low (38) — grinding through far more edits and retries, and still landing the fewest builds:

Avg tool steps per project (more ≠ better)
Opus 4.8 · low
38 steps
Opus 4.8 · max
99 steps
Sonnet 5 · max
230 steps

Tool-use calls per generated project, averaged over scored specs.

The read on Sonnet 5: a great sense of what to use, a weaker hand at assembling a dense stack into something coherent — and it works harder to get there.

Maximum reasoning didn't buy more builds

The obvious hypothesis after ScaffBench 2 was that the models knew the right libraries and just ran out of room to wire them — so maximum reasoning should turn the near-misses into builds. It didn't.

On the same 13 specs, Opus 4.8 scored identical at low and max — 3/11 either way — for 5–10× the cost:

Avg cost per project ($)
Opus 4.8 · low
$1.63
Sonnet 5 · max
$4.99
Opus 4.8 · max
$9.21

Metered from token usage. Opus at max spent $13–19 on the heaviest specs — none of which built.

And the tie hides the most interesting result. Max didn't build the same three projects as low — it fixed two and broke two:

SpecOpus 4.8 · lowOpus 4.8 · max
python-ingestion-api
rust-leptos-axumfixed
java-spring-jooq-keycloakfixed
go-realtime-apibroke
multi-dotnet-opsbroke

Extra deliberation traded one project's success for another's rather than lifting the total. More thinking changed which hard stacks Opus assembled correctly, not how many. The bottleneck isn't reasoning — it's coherent assembly.

What's new in ScaffBench 2.1

  • Eight ecosystems, 13 specs. TypeScript, Rust, Python, Go, .NET, and now Java (Spring + jOOQ + Keycloak) and Elixir (Broadway + Absinthe), plus an Expo React Native app and multi-ecosystem graphs — each chosen so a confident agent can pick a near neighbour and get it subtly wrong.
  • Frontier, prompt-only specs. Two specs are deliberately beyond our own generator's option space, so they only ever run on the prompt path — no scaffolder could "cheat" them.
  • Real cost accounting. Claude Code reports $0 on a subscription, which made capable models look free. We now price every run from its token usage — which is how the max-effort cost above is even visible.
  • Validated end to end. With the .NET SDK on the grading host, the .NET and multi-.NET specs are now genuinely built rather than skipped.

The honest part

Two of the results aren't clean build verdicts, and the benchmark is only useful if we say so.

The second caveat is ts-svelte-edge-orpc for Opus 4.8 at max: it was still actively reasoning when it hit the per-spec time ceiling and was cut off. Per our SWE-bench-style taxonomy a generation timeout counts as a model failure — but it's a softer failure than a clean non-build, and worth flagging.

How we score

Takeaways

  • Sonnet 5 knows the stack, but assembles fewer builds. It ties the field on library selection (~97% wired) yet trails on projects that actually compile (1/8), while doing the most work to get there (230 steps/project).
  • Reasoning effort isn't the lever. Opus built the same 3/11 at max as at low, for 5–10× the cost — extra deliberation reshuffles which stacks assemble, not how many.
  • Stack selection is solved; coherent assembly isn't. ~97% wired at a ~13–27% build rate is the whole story in one line — and it's exactly the gap a scaffolder closes, which is why the MCP and CLI paths in ScaffBench 2 sat near 100% while prompt-only struggled.

We'll update this post with Sonnet 5's remaining specs and more models as the full runs complete.

Patreon