ScaffBench 2.1: Claude Sonnet 5 vs Claude Opus 4.8 at scaffolding from a prompt
Claude Sonnet 5 enters ScaffBench 2.1, tested against Opus 4.8 (low + max effort) on 13 hard full-stack specs: can a model scaffold a building project from a prompt?
July 1, 2026 · Ibrahim Elkamali
ScaffBench 2 asked the hardest version of one question — can a coding agent hand-write a real full-stack project from a prompt, with no scaffolder, and have it build? The answer was "rarely."
ScaffBench 2.1 puts the newest model, Claude Sonnet 5, to that test — head-to-head with Claude Opus 4.8 at both low and maximum reasoning effort, on an expanded suite of 13 hard specs across eight ecosystems. The prompt path is the purest measure of raw capability: the agent reads the spec and writes every file itself.
Opus 4.8 has finished the full 13-spec run at both efforts. Sonnet 5's max run is still generating, so its numbers below are the nine specs validated so far. Here's where things stand.
Leaderboard
CORE pass@1 = the project installs, builds, type-checks, and native-compiles from a prompt. Higher is better; cost and steps are per project.
| Config | CORE pass@1 | Avg cost | Out tokens | Tool steps | Wired libs |
|---|---|---|---|---|---|
| Claude Opus 4.8 · low | 3 / 11 (27%) | $1.63 | 21.5k | 38 | 96% |
| Claude Opus 4.8 · max | 3 / 11 (27%) | $9.21 | 148k | 99 | 97% |
| Claude Sonnet 5 · max | 1 / 8 (13%) | $4.99 | 111k | 230 | 97% |
Sonnet 5 = 9 of 13 specs so far (run in progress). Inconclusive specs excluded — see The honest part.
Claude Sonnet 5: strong instincts, fewer builds
Sonnet 5 selects the right stack as well as Opus — 97% wired, matching it spec-for-spec — but it
turns fewer of those into projects that compile: 1 of 8, versus Opus's 3 of 11. The one spec both
reliably build is python-ingestion-api; the dense multi-service specs defeat both.
| Spec | Ecosystem | Opus 4.8 · max | Sonnet 5 · max |
|---|---|---|---|
python-ingestion-api | Python | ✅ builds | ✅ builds |
rust-leptos-axum | Rust | ✅ builds | ❌ no build |
java-spring-jooq-keycloak | Java | ✅ builds | ❌ no build |
ai-search-workbench | TypeScript | ❌ no build | ❌ no build |
go-realtime-api | Go | ❌ no build | ❌ no build |
multi-dotnet-ops | Multi (TS + .NET) | ❌ no build | ❌ no build |
ts-svelte-edge-orpc | TypeScript | ❌ timeout | ❌ no build |
dotnet-blazor-cqrs | .NET | ❌ no build | ❌ no build |
multi-ts-go-grpc | Multi (TS + Go) | ⚪ inconclusive | ⚪ inconclusive |
Opus uniquely landed two builds Sonnet couldn't — rust-leptos-axum and java-spring-jooq-keycloak.
Sonnet 5 is also the busiest run in the field. It averaged 230 tool steps per project — more than double Opus at max (99) and 6× Opus at low (38) — grinding through far more edits and retries, and still landing the fewest builds:
Tool-use calls per generated project, averaged over scored specs.
The read on Sonnet 5: a great sense of what to use, a weaker hand at assembling a dense stack into something coherent — and it works harder to get there.
Maximum reasoning didn't buy more builds
The obvious hypothesis after ScaffBench 2 was that the models knew the right libraries and just ran out of room to wire them — so maximum reasoning should turn the near-misses into builds. It didn't.
On the same 13 specs, Opus 4.8 scored identical at low and max — 3/11 either way — for 5–10× the cost:
Metered from token usage. Opus at max spent $13–19 on the heaviest specs — none of which built.
And the tie hides the most interesting result. Max didn't build the same three projects as low — it fixed two and broke two:
| Spec | Opus 4.8 · low | Opus 4.8 · max |
|---|---|---|
python-ingestion-api | ✅ | ✅ |
rust-leptos-axum | ❌ | ✅ fixed |
java-spring-jooq-keycloak | ❌ | ✅ fixed |
go-realtime-api | ✅ | ❌ broke |
multi-dotnet-ops | ✅ | ❌ broke |
Extra deliberation traded one project's success for another's rather than lifting the total. More thinking changed which hard stacks Opus assembled correctly, not how many. The bottleneck isn't reasoning — it's coherent assembly.
What's new in ScaffBench 2.1
- Eight ecosystems, 13 specs. TypeScript, Rust, Python, Go, .NET, and now Java (Spring + jOOQ + Keycloak) and Elixir (Broadway + Absinthe), plus an Expo React Native app and multi-ecosystem graphs — each chosen so a confident agent can pick a near neighbour and get it subtly wrong.
- Frontier, prompt-only specs. Two specs are deliberately beyond our own generator's option space, so they only ever run on the prompt path — no scaffolder could "cheat" them.
- Real cost accounting. Claude Code reports
$0on a subscription, which made capable models look free. We now price every run from its token usage — which is how the max-effort cost above is even visible. - Validated end to end. With the .NET SDK on the grading host, the .NET and multi-.NET specs are now genuinely built rather than skipped.
The honest part
Two of the results aren't clean build verdicts, and the benchmark is only useful if we say so.
The second caveat is ts-svelte-edge-orpc for Opus 4.8 at max: it was still actively reasoning when
it hit the per-spec time ceiling and was cut off. Per our SWE-bench-style taxonomy a generation timeout
counts as a model failure — but it's a softer failure than a clean non-build, and worth flagging.
How we score
Takeaways
- Sonnet 5 knows the stack, but assembles fewer builds. It ties the field on library selection (~97% wired) yet trails on projects that actually compile (1/8), while doing the most work to get there (230 steps/project).
- Reasoning effort isn't the lever. Opus built the same 3/11 at max as at low, for 5–10× the cost — extra deliberation reshuffles which stacks assemble, not how many.
- Stack selection is solved; coherent assembly isn't. ~97% wired at a ~13–27% build rate is the whole story in one line — and it's exactly the gap a scaffolder closes, which is why the MCP and CLI paths in ScaffBench 2 sat near 100% while prompt-only struggled.
We'll update this post with Sonnet 5's remaining specs and more models as the full runs complete.